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Abstract  The  goal  of  high-level  event  recognition  is  to 
automatically  detect  complex  high-level  events  in  a  given 
video  sequence.  This  is  a  difficult  task  especially  when 
videos  are  captured  under  unconstrained  conditions  by  non¬ 
professionals.  Such  videos  depicting  complex  events  have 
limited  quality  control,  and  therefore,  may  include  severe 
camera  motion,  poor  lighting,  heavy  background  clutter,  and 
occlusion.  However,  due  to  the  fast  growing  popularity  of 
such  videos,  especially  on  the  Web,  solutions  to  this  problem 
are  in  high  demands  and  have  attracted  great  interest  from 
researchers.  In  this  paper,  we  review  current  technologies  for 
complex  event  recognition  in  unconstrained  videos.  While 
the  existing  solutions  vary,  we  identify  common  key  modules 
and  provide  detailed  descriptions  along  with  some  insights 
for  each  of  them,  including  extraction  and  representation  of 
low-level  features  across  different  modalities,  classification 
strategies,  fusion  techniques,  etc.  Publicly  available  bench¬ 
mark  datasets,  performance  metrics,  and  related  research 
forums  are  also  described.  Finally,  we  discuss  promising 
directions  for  future  research. 
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1  Introduction 

High-level  video  event  recognition  is  the  process  of  automat¬ 
ically  identifying  video  clips  that  contain  events  of  interest. 
The  high-level  or  complex  events — by  our  definition — are 
long-term  spatially  and  temporally  dynamic  object  interac¬ 
tions  that  happen  under  certain  scene  settings.  Two  popu¬ 
lar  categories  of  complex  events  are  instructional  and  social 
events.  The  former  includes  procedural  videos  (e.g.,  “mak¬ 
ing  a  cake”,  “changing  a  vehicle  tire”),  while  the  latter 
includes  social  activities  (e.g.,  “birthday  party”,  “parade”, 
“flash  mob”).  Techniques  for  recognizing  such  high-level 
events  are  essential  for  many  practical  applications  such  as 
Web  video  search,  consumer  video  management,  and  smart 
advertising. 

The  focus  of  this  work  is  to  address  the  issues  related 
to  high-level  event  recognition.  Events,  actions,  interactions, 
activities,  and  behaviors  have  been  used  interchangeably  in 
the  literature  [1,15],  and  there  is  no  agreement  on  the  precise 
definition  of  each  term.  In  this  paper,  we  attempt  to  pro¬ 
vide  a  hierarchical  model  for  complex  event  recognition  in 
Fig.  1.  Movement  is  the  lowest  level  description:  “an  entity 
(e.g.  hand)  is  moved  with  large  displacement  in  right  direc¬ 
tion  with  slow  speed”.  Movements  can  also  be  referred  as 
attributes  which  have  been  recently  used  in  human  action 
recognition  [73]  following  their  successful  use  in  face  recog¬ 
nition  in  a  single  image.  Next  are  activities  or  actions,  which 
are  sequences  of  movements  (e.g.  “hand  moving  to  right  fol¬ 
lowed  by  hand  moving  to  left”,  which  is  a  “waving”  action). 
An  action  has  a  more  meaningful  interpretation  and  is  often 
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Fig.  1  A  taxonomy  of  semantic  categories  in  videos,  with  increased 
complexity  from  bottom  to  top.  Attributes  are  basic  components  (e.g., 
movements)  of  actions,  while  actions  are  key  elements  of  interactions. 


High-level  events  (focus  of  this  paper)  lie  on  top  of  the  hierarchy,  which 
contain  (normally  multiple)  complex  actions  and  interactions  evolving 
over  time 


performed  by  entities  (e.g.,  human,  animal,  and  vehicle).  An 
action  can  also  be  performed  between  two  or  more  entities, 
which  is  commonly  referred  to  as  an  interaction  (e.g.,  per¬ 
son  lifts  an  object,  person  kisses  another  person,  car  enters 
facility,  etc.).  Motion  verbs  can  also  be  used  to  describe  inter¬ 
actions.  Recently  the  Mind’s  eye  dataset  is  released  under  a 
DARPA  program  which  contains  many  motion  verbs  such  as 
“approach”,  “lift”,  etc  [11].  In  this  hierarchy,  concepts  span 
across  both  actions  and  interactions.  In  general,  concept  is 
a  loaded  word,  which  has  been  used  to  represent  objects, 
scenes,  and  events,  such  as  those  defined  in  large-scale  con¬ 
cept  ontology  for  multimedia  (LSCOM)  [95].  Finally,  at  the 
top  level  of  the  hierarchy,  we  have  complex  or  high-level 
events  that  have  larger  temporal  durations  and  consist  of  a 
sequence  of  interactions  or  stand-alone  actions,  e.g.,  an  event 
“changing  a  vehicle  tire”  contains  a  sequence  of  interactions 
such  as  “person  opening  trunk”  and  “person  using  wrench”, 
followed  by  actions  such  as  “squatting”  and  so  on.  Simi¬ 
larly,  another  complex  event  such  as  “birthday  party”  may 
involve  actions  like  “person  clapping”  and  “person  singing”, 
followed  by  interactions  like  “person  blowing  candle”  and 
“person  cutting  cake”.  Note  that  although  we  have  attempted 
to  encapsulate  most  semantic  components  of  complex  events 


in  a  single  hierarchy,  because  of  the  polysemous  nature  of  the 
words,  adopting  the  same  terminologies  in  the  research  com¬ 
munity  is  an  impossible  objective  to  achieve. 

Having  said  that,  we  set  the  context  of  event  recognition  as 
the  detection  of  temporal  and  spatial  locations  of  the  complex 
event  in  the  video  sequence.  In  a  simplified  case  when  tem¬ 
poral  segmentation  of  video  into  clips  has  been  achieved,  or 
where  each  video  contains  only  one  event  and  precise  spatial 
localization  is  not  important;  it  reduces  to  a  video  classifica¬ 
tion  problem. 

While  many  existing  works  have  only  employed  the  visual 
modality  for  event  recognition,  it  is  important  to  empha¬ 
size  that  video  analysis  is  intrinsically  multimodal,  demand¬ 
ing  multidisciplinary  knowledge  and  tools  from  many  fields, 
such  as  computer  vision,  audio  and  speech  analysis,  multime¬ 
dia,  and  machine  learning.  To  deal  with  large  scale  data  that 
is  common  nowadays,  scalable  indexing  methods  and  paral¬ 
lel  computational  platforms  are  also  becoming  an  important 
part  of  modern  video  analysis  systems. 

There  exist  many  challenges  in  developing  automatic 
video  event  recognition  systems.  One  well-known  challenge 
is  the  long-standing  semantic  gap  between  computable  low- 
level  features  (e.g.,  visual,  audio,  and  textual  features)  and 
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semantic  information  that  they  encode  (e.g.,  the  presence  of 
meaningful  classes  such  as  “a  person  clapping”,  “sound  of  a 
crowd  cheering”,  etc.)  [129] .  Current  approaches  heavily  rely 
on  classifier-based  methods  employing  directly  computable 
features.  In  other  words,  these  classifiers  attempt  to  establish 
a  correspondence  between  the  computed  features  or  a  quan¬ 
tized  layer  on  the  features  to  the  actual  label  of  the  event 
depicted  in  the  video.  In  doing  so,  they  lack  a  semantically 
meaningful,  yet  conceptually  abstract,  intermediate  repre¬ 
sentation  of  the  complex  event,  which  can  be  used  to  explain 
what  a  particular  event  is,  and  how  such  representation  can 
be  used  to  recognize  other  complex  events.  This  is  why, 
with  much  progress  made  in  the  past  decade  in  this  context, 
the  computational  approaches  involved  in  complex  event 
recognition  are  reliable  only  under  certain  domain- specific 
constraints. 

Moreover,  with  the  popularity  of  handheld  video  record¬ 
ing  devices,  the  issue  becomes  more  serious  since  a  huge 
amount  of  videos  are  currently  being  captured  by  non¬ 
professional  users  under  unconstrained  conditions  with 
limited  quality  control  (for  example,  in  contrast  to  videos 
from  broadcast  news,  documentary,  or  controlled  surveil¬ 
lance).  This  amplifies  the  semantic  gap  challenge.  However, 
it  also  opens  a  great  opportunity  since  the  proliferation  of 
such  user-generated  videos  has  greatly  contributed  to  the 
rapidly  growing  demands  for  new  capabilities  in  recognition 
of  high-level  events  in  videos. 

In  this  paper,  we  will  first  discuss  the  current  popular 
methods  for  high-level  video  event  recognition  from  mul¬ 
timedia  data.  We  will  review  multimodal  features,  models, 
and  evaluation  strategies  that  have  been  widely  studied  by 
many  groups  in  the  recent  literature.  Compared  to  a  few  exist¬ 
ing  survey  papers  in  this  area  (as  summarized  in  the  follow¬ 
ing  subsection),  this  paper  has  a  special  focus  on  high-level 
events.  We  will  provide  in-depth  descriptions  of  techniques 
that  have  been  shown  promising  in  recent  benchmark  evalua¬ 
tion  activities  such  as  the  multimedia  event  detection  (MED) 
task  [99]  of  NIST  TRECVID.1  Additionally,  we  will  discuss 
several  important  related  issues  such  as  the  designs  of  eval¬ 
uation  benchmarks  for  high-level  event  recognition.  Finally, 
we  will  identify  promising  directions  for  future  research  and 
developments.  To  stimulate  further  research,  at  the  end  of 
each  important  section  we  provide  comments  summarizing 
the  issues  with  the  discussed  approaches  and  insights  that 
may  be  useful  for  the  development  of  future  high-level  event 
recognition  systems. 


1  TREC  video  retrieval  evaluation  (TRECVID)  [128]  is  an  open  forum 
for  promoting  and  evaluating  new  research  in  video  retrieval.  It  fea¬ 
tures  a  benchmark  activity  sponsored  annually,  since  2001,  by  the  US 
National  Institute  of  Standards  and  Technology  (NIST).  See  http:// 
trecvid.nist.gov  for  more  details. 


1 . 1  Related  reviews 

There  have  been  several  related  papers  that  review  the 
research  of  video  content  recognition.  Most  of  them  focused 
on  human  action/activity  analysis,  e.g.,  [1]  by  Aggarwal  and 
Ryoo,  [111]  Poppe  and  [139]  Turaga  et  al.,  where  low-level 
features,  representations,  classification  models,  and  datasets 
were  comprehensively  surveyed.  While  most  human  activity 
research  was  done  on  constrained  videos  with  limited  content 
(e.g.,  clean  background  and  no  camera  motion),  recent  works 
have  also  shifted  focus  to  the  analysis  of  realistic  videos  such 
as  user-uploaded  videos  on  the  Internet,  or  broadcast,  and 
documentary  videos. 

In  [130],  Snoek  and  Worring  surveyed  approaches  to  mul¬ 
timodal  video  indexing,  focusing  on  methods  for  detect¬ 
ing  various  semantic  concepts  consisting  of  mainly  objects 
and  scenes.  They  also  discussed  video  retrieval  techniques 
exploring  concept-based  indexing,  where  the  main  applica¬ 
tion  data  domains  were  broadcast  news  and  documentary 
videos.  Brezeale  and  Cook  [17]  surveyed  text,  video,  and 
audio  features  for  classifying  videos  into  a  predefined  set  of 
genres,  e.g.,  “sports”  or  “comedy”.  Morsillo  et  al.  [94]  pre¬ 
sented  a  brief  review  that  focused  on  efficient  and  scalable 
methods  for  annotating  Web  videos  at  various  levels  includ¬ 
ing  objects,  scenes,  actions,  and  high-level  events.  Lavee  et 
al.  [67]  reviewed  event  modeling  methods,  mostly  in  the  con¬ 
text  of  simple  human  activity  analysis.  A  review  more  related 
to  this  paper  is  the  one  by  Ballan  et  al.  [8],  which  discussed 
features  and  models  for  detecting  both  simple  actions  and 
complex  events  in  videos. 

Different  from  the  existing  surveys  mentioned  above,  this 
paper  concentrates  on  the  recognition  of  high-level  com¬ 
plex  events  from  multimedia  data,  such  as  those  mentioned 
in  Fig.  1.  Many  techniques  for  recognizing  objects,  scenes, 
and  human  activities  will  be  discussed  to  the  extent  that  is 
needed  for  understanding  high-level  event  recognition.  How¬ 
ever,  providing  full  coverage  of  those  topics  is  beyond  the 
scope  of  this  work. 

1.2  Outline 

We  first  organize  the  research  of  high-level  event  recogni¬ 
tion  into  several  key  dimensions  (shown  in  Fig.  2),  based  on 
which  the  paper  will  be  structured.  While  the  design  of  a  real 
system  may  depend  on  application  requirements,  key  com¬ 
ponents  like  those  identified  (feature  extraction  and  recog¬ 
nition  model)  are  essential.  These  core  technologies  will  be 
discussed  in  Sects.  2  and  3.  In  Sect.  4,  we  discuss  advanced 
issues  beyond  simple  video-level  classification,  such  as  tem¬ 
poral  localization  of  events,  textual  recounting  of  detection 
results,  and  techniques  for  improving  recognition  speed  and 
dealing  with  large-scale  data.  To  help  stimulate  new  research 
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Fig.  2  Overview  of  various 
aspects  of  the  video  event 
recognition  research. 
Presentation  of  the  paper  is 
structured  based  on  this 
organization.  Numbers 
correspond  to  the  sections 
covering  the  topics 


activities,  we  present  reviews  of  popular  benchmarks  and 
explore  insights  of  several  top-performing  systems  in  recent 
evaluation  forums  in  Sect.  5.  Finally,  we  discuss  promising 
directions  for  future  research  in  Sect.  6  and  present  conclu¬ 
sions  in  Sect.  7. 

2  Feature  representations 

Features  play  a  critical  role  in  video  event  analysis.  Good 
features  are  expected  to  be  robust  against  variations  so  that 
videos  of  the  same  event  class  under  different  conditions  can 
still  be  correctly  recognized.  There  are  two  main  sources  of 
information  that  can  be  exploited.  The  visual  channel,  on 
one  hand,  depicts  appearance  information  related  to  objects, 
scene  settings,  while  on  the  other  hand,  captures  motion 
information  pertaining  to  the  movement  of  the  constituent 
objects  and  the  motion  of  the  camera.  The  second  is  the 
acoustic  channel,  which  may  contain  music,  environmen¬ 
tal  sounds  and/or  conversations.  Both  channels  convey  use¬ 
ful  information,  and  many  visual  and  acoustic  features  have 
been  devised.  We  discuss  static  frame-based  visual  features 
in  Sect.  2.1,  spatio-temporal  visual  features  in  Sect.  2.2, 
acoustic  features  in  Sect.  2.3,  audio-visual  joint  representa¬ 
tions  in  Sect.  2.4,  and  finally,  the  bag-of-features  framework, 
which  converts  audio/visual  features  into  fixed  dimensional 
vectors  in  Sect.  2.5. 


2.1  Frame-based  appearance  features 

Appearance-based  features  are  computed  from  a  single 
frame.  They  do  not  consider  the  temporal  dimension  of  video 
sequences  but  are  widely  used  in  video  analysis  since  they  are 
relatively  easy  to  compute  and  have  been  shown  to  work  well 
in  practice.  There  has  been  a  very  rich  knowledge  base  and 
extensive  publicly  available  resources  (public  tools)  devoted 
to  static  visual  features.  We  divide  existing  works  into  local 
and  global  features,  as  will  be  discussed  in  the  following. 

2.1.1  Local  features 

A  video  frame  can  be  represented  efficiently  using  a  set  of 
discriminative  local  features  extracted  from  it.  The  extraction 
of  local  features  consists  of  two  steps:  detection  and  descrip¬ 
tion.  Detection  refers  to  the  process  of  locating  stable  patches 
that  have  some  desirable  properties  which  can  be  employed 
to  create  a  “signature”  of  an  image.  In  practice,  uniform  and 
dense  sampling  of  image  patches,  with  some  obvious  storage 
overhead  is  often  used  in  comparison  to  the  rather  compu¬ 
tationally  expensive,  less  storage  intensive  patch  detection 
[101]. 

Among  popular  local  patch  (a.k.a.  interest  point)  detection 
algorithms,  the  most  widely  used  one  is  Lowe’s  Difference- 
of-Gaussian  (DoG)  [77],  which  detects  blob  regions  where 
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Fig.  3  Example  results  of  local  detectors:  Harris-Laplace  (left)  and 
DoG  (right).  The  images  are  obtained  from  [30] 


the  center  differs  from  the  surrounding  area.  Other  popular 
detectors  include  Harris-Laplace  [72],  Hessian  [88],  maxi¬ 
mally  stable  extremal  regions  (MSERs)  [85],  etc.  Harris  and 
Hessian  focus  on  detection  of  corner  points.  MSER  is  also 
for  blob  detection,  but  relies  on  a  different  scheme.  Unlike 
DoG  which  detects  local  maximums  in  multi-scale  Gaussian 
filtered  images,  MSER  finds  regions  whose  segmentation  is 
stable  over  a  large  range  of  thresholds.  Figure  3  shows  exam¬ 
ple  results  of  two  detectors.  Interested  readers  are  referred 
to  [90]  for  a  comprehensive  review  of  several  local  patch 
detectors.  Although  it  is  observed  that  dense  sampling  elimi¬ 
nates  the  requirement  of  detectors,  recent  experiments  shown 
in  [34]  confirm  that  methods  using  both  strategies  (sparse 
detection  and  dense  sampling)  offer  the  best  performance  in 
visual  recognition  tasks.  In  this  direction,  Tuytlaars  proposed 
a  hybrid  selection  strategy  [140]  where  the  author  demon¬ 
strated  how  the  advantages  of  both  sampling  schemes  can  be 
efficiently  combined  to  improve  recognition  performance. 

Once  local  patches  are  identified,  the  next  stage  is  to 
describe  them  in  a  meaningful  manner  so  that  the  resulted 
descriptors  are  (partially)  invariant  to  rotation,  scale,  view¬ 
point,  and  illumination  changes.  Since  the  descriptors  are 
computed  from  small  patches  as  compared  to  a  whole  frame, 
they  are  also  somewhat  robust  to  partial  occlusion  and  back¬ 
ground  clutter. 

Many  descriptors  have  been  designed  over  the  years. 
The  best-known  is  scale-invariant  feature  transform  (SIFT) 
[77],  which  partitions  a  patch  into  equal-sized  grids,  each 
described  by  a  histogram  of  gradient  orientations.  A  key  idea 
of  SIFT  is  that  a  patch  is  represented  relative  to  its  dominant 
orientation,  which  provides  a  nice  property  of  rotation  invari¬ 
ance.  SIFT,  coupled  with  several  local  detectors  introduced 
above,  has  been  among  the  most  popular  choices  in  recent 
video  event  recognition  systems  [10,58,96,98]. 

SIFT  has  been  extended  in  various  ways.  PCA-SIFT  was 
proposed  by  Ke  et  al.  [60],  who  applied  principal  component 
analysis  (PCA)  to  reduce  the  dimensions  of  SIFT.  It  stated 
that  PCA-SIFT  is  not  only  compact  but  also  more  robust  since 
PCA  may  help  reduce  noise  in  the  original  SIFT  descriptors. 
However,  such  performance  gains  of  PCA-SIFT  were  not 
found  in  the  comparative  study  in  [89].  An  improved  version 


of  SIFT,  called  gradient  location  and  orientation  histogram 
(GLOH),  was  proposed  in  [89],  to  use  a  log-polar  location 
grid  instead  of  the  original  rectangular  grid  in  SIFT.  Work  in 
[119]  studied  color  descriptors  that  incorporated  color  infor¬ 
mation  into  the  intensity-based  SIFT  for  improved  object  and 
scene  recognition.  They  reported  a  performance  gain  of  8  % 
on  PASCAL  VOC  2007  dataset.2  Further,  to  improve  the 
computational  efficiency,  Bay  et  al.  [12]  developed  SURF  as 
a  fast  alternative  descriptor  using  2D  Haar  wavelet  responses. 

Several  other  descriptors  have  also  been  popular  in  this 
context.  Histogram  of  oriented  gradients  (HOG)  was  pro¬ 
posed  by  Dalai  and  Triggs  [27]  to  capture  edge  distributions 
in  images  or  video  frames.  Local  binary  pattern  (LBP)  [103] 
is  another  texture  feature  which  uses  binary  numbers  to  label 
each  pixel  of  a  frame  by  comparing  its  value  to  that  of  its 
neighborhood  pixels. 

2.7.2  Global  features 

In  earlier  works  global  representations  were  employed, 
which  encode  a  whole  image  based  on  the  overall  distrib¬ 
ution  of  color,  texture,  or  edge  information.  Popular  ones 
include  color  histogram,  color  moments  [166],  and  Gabor 
texture  [83].  Oliva  and  Torralba  [104]  proposed  a  very  low 
dimensional  scene  representation  which  implicitly  encodes 
perceptual  naturalness,  openness,  roughness,  expansion, 
ruggedness  using  spectral  and  coarsely  localized  informa¬ 
tion.  Since  this  represents  the  dominant  spatial  structure  of 
a  scene,  it  is  referred  to  as  the  GIST  descriptor.  Most  of 
these  global  features  adopt  grid-based  representations  which 
take  spatial  distribution  of  the  scene  into  account  (e.g.,  “sky” 
always  appears  above  “road”).  Features  are  computed  within 
each  grid  separately  and  then  concatenated  as  the  final  repre¬ 
sentation.  This  simple  strategy  has  been  shown  to  be  effective 
for  various  image/video  classification  tasks. 

Summary  Single-frame  based  feature  representations — 
such  as  SIFT,  GIST,  HOG,  etc. — are  the  most  straightforward 
to  compute  and  have  low-complexity.  These  features  have 
been  shown  to  be  extremely  discriminative  for  videos  that  do 
not  depict  rapid  inter- frame  changes.  For  videos  with  rapid 
content  changes,  one  needs  to  carefully  sample  frames  from 
which  these  features  can  be  extracted  if  not  all  the  frames  are 
used.  Since  an  optimal  keyframe  selection  strategy  is  yet  to  be 
developed,  researchers  in  practice  sample  frames  uniformly. 
A  low-sampling  rate  could  lead  to  loss  of  vital  information, 
while  high  sampling  rates  result  in  redundancies.  Further¬ 
more,  these  features  do  not  include  temporal  information, 
and  hence  they  are  ineffective  in  representing  motion,  a  very 


2  PASCAL  visual  object  class  (VOC)  challenge  is  an  annual 
benchmark  competition  on  image-based  object  recognition,  supported 
by  EU-funded  PASCAL2  Network  of  Excellence  on  Pattern  Analysis, 
Statistical  Modeling  and  Computational  Learning. 
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important  source  of  information  in  videos.  This  motivates  us 
to  move  on  to  the  next  section  that  discusses  spatio-temporal 
(motion)  features. 

2.2  Spatio-temporal  visual  features 

Different  from  frame-based  features,  spatio-temporal  fea¬ 
tures  take  the  time  dimension  of  videos  into  account,  which 
is  intuitively  appealing  since  temporal  motion  information  is 
critical  for  understanding  high-level  events. 

2.2.7  Spatio-temporal  local  features 

Many  spatio-temporal  video  features  have  been  proposed. 
Apart  from  several  efforts  in  designing  global  spatio- 
temporal  representations,  a  more  popular  direction  is  to 
extend  the  frame-based  local  features  to  work  in  3D  (x,  y ,  t ), 
namely  spatio-temporal  descriptors.  In  [65],  Laptev  extended 
the  Harris  corner  patch  detector  [72]  to  locate  spatial- 
temporal  interest  points  (STIPs),  which  are  space-time  vol¬ 
umes  in  which  pixel  values  have  significant  variations  in 
both  space  and  time.  Figure  4  gives  two  example  results  of 
STIP  detection.  As  will  be  discussed  in  later  sections,  STIP 
has  been  frequently  used  in  recent  video  event  recognition 
systems. 

Several  alternatives  of  STIP  have  been  proposed.  In 
Dollar  et  al.  [29]  proposed  to  use  Gabor  filters  for  3D  key- 
point  detection.  The  detector,  called  Cuboid,  finds  local  max¬ 
ima  of  a  response  function  that  contains  a  2D  Gaussian 
smoothing  kernel  and  ID  temporal  Gabor  filters.  Rapantzikos 
et  al.  [112]  used  saliency  to  locate  spatio-temporal  points, 
where  the  saliency  is  computed  by  a  global  minimization 
process  which  leverages  spatial  proximity,  scale,  and  fea¬ 
ture  similarity.  To  compute  the  feature  similarity,  they  also 
utilized  color,  in  addition  to  intensity  and  motion  that  are 
commonly  adopted  in  other  detectors.  Moreover,  Willems 
et  al.  [158]  used  the  determinant  of  a  Hessian  matrix  as  the 
saliency  measure,  which  can  be  efficiently  computed  using 
box-filter  operations  on  integral  videos.  Wang  et  al.  [151] 


Fig.  4  Results  of  STIP  detection  using  a  synthetic  sequence  (left)  and  a 
realistic  video  (right,  the  detected  points  are  shown  on  an  image  frame). 
The  images  are  reprinted  from  [65]  (©2005  Springer- Verlag) 


conducted  a  comparative  study  on  spatio-temporal  local  fea¬ 
tures  and  found  that  dense  sampling  works  better  than  sparse 
detectors  STIP,  Cuboid,  and  Hessian,  particularly  on  videos 
captured  under  realistic  settings  (in  contrast  to  those  taken  in 
constrained  environment  with  clean  background).  The  dense 
sampling,  however,  requires  a  much  larger  number  of  fea¬ 
tures  to  achieve  a  good  performance. 

Like  the  2D  local  features,  we  also  need  descriptors  to 
encode  the  3D  spatio-temporal  points  (volumes).  Most  exist¬ 
ing  3D  descriptors  are  motivated  from  those  designed  for 
the  2D  features.  Dollar  et  al.  [29]  tested  simple  flattening  of 
intensity  values  in  a  cuboid  around  an  interest  point,  as  well 
as  global  and  local  histograms  of  gradients  and  optical  flow. 
SIFT  [77]  was  extended  to  3D  by  Scovanner  et  al.  [122],  and 
SURF  [12]  was  adapted  to  3D  by  Knopp  et  al.  [62].  Laptev 
et  al.  [66]  used  grid-based  (by  dividing  a  3D  volume  into 
multiple  grids)  HOG  and  histogram  of  optical  flow  (HOF) 
to  describe  STIPs  [65],  and  found  the  concatenation  of  HOG 
and  HOF  descriptors  very  effective.  Biologically  the  com¬ 
bination  of  the  two  descriptors  also  makes  good  sense  since 
HOG  encodes  appearance  information  while  HOF  captures 
motion  clue.  Klaser  et  al.  [61]  also  extended  HOG  to  3D  and 
proposed  to  utilize  integral  videos  for  fast  descriptor  compu¬ 
tation.  The  self- similarities  descriptor  was  adapted  by  Junejo 
et  al.  [123]  for  cross-view  action  recognition.  Recently, 
Taylor  et  al.[135]  proposed  to  use  convolutional  neural 
networks  to  implicitly  learn  spatio-temporal  descriptors, 
and  obtained  similar  human  action  recognition  performance 
comparable  to  the  STIP  detector  paired  with  HOG-HOF 
descriptors.  Le  et  al.  [69]  combined  independent  subspace 
analysis  (ISA)  with  ideas  from  convolutional  neural  net¬ 
works  to  learn  invariant  spatio-temporal  features.  Better 
results  from  the  ISA  features  over  the  standard  STIP  detector 
and  HOG-HOF  descriptors  were  achieved  on  several  action 
recognition  benchmarks  [69] . 

2.2.2  Trajectory  descriptors 

Spatio-temporal  information  can  also  be  captured  by  track¬ 
ing  the  frame-based  local  features.  These  descriptors  are 
theoretically  superior  to  descriptors  such  as  HOG-HOF,  3D 
SURF,  Dollar  Cuboids,  etc.  This  is  because  they  require  the 
detection  of  a  discriminative  point  or  region  over  a  sustained 
period  of  time,  unlike  the  latter  that  computes  various  pixel- 
based  statistics  subjected  to  a  predefined  spatio-temporal 
neighborhood  (typically  50  x  50  x  20  pixels).  However, 
computation  of  trajectory  descriptors  requires  substantial 
computational  overhead.  The  first  of  its  kind  was  proposed 
by  Wang  et  al.  [149]  where  the  authors  used  the  well- 
known  Kanade-Lucas-Tomasi  (KLT)  tracker  [79]  to  extract 
DoG-SIFT  key-point  trajectories,  and  compute  a  feature  by 
modeling  the  motion  between  every  trajectory  pair.  Sun  et 
al.  [132]  also  applied  KLT  to  track  DoG-SIFT  key-points. 
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Different  from  [149],  they  computed  three  levels  of  trajec¬ 
tory  context,  including  point-level  context  which  is  an  aver¬ 
aged  SIFT  descriptor,  intra-trajectory  context  which  models 
trajectory  transitions  over  time,  and  inter-trajectory  context 
which  encodes  proximities  between  trajectories.  The  veloc¬ 
ity  histories  of  key -point  trajectories  are  modeled  by  Messing 
et  al.  [87],  who  observed  that  velocity  information  is  useful 
for  detecting  daily  living  actions  in  high-resolution  videos. 
Uemura  et  al.  [  1 4 1  ]  combined  feature  tracking  and  frame  seg¬ 
mentation  to  estimate  dominant  planes  in  the  scene,  which 
were  used  for  motion  compensation.  In  Yuan  et  al.  [171] 
clustered  key-point  trajectories  based  on  spatial  proximities 
and  motion  patterns.  Like  [141],  this  method  extracts  rela¬ 
tive  features  from  clusters  of  trajectories  on  the  background 
that  describe  the  motion  differently  from  those  emanating 
from  the  foreground.  As  a  result,  the  effect  of  camera  motion 
can  be  alleviated  using  this  approach.  In  addition,  Raptis  and 
Soatto  [113]  proposed  tracklet,  which  differs  from  the  long¬ 
term  trajectories  by  capturing  the  local  casual  structure  of 
action  elements.  A  more  recent  work  by  Wu  et  al.  [159]  used 
Lagrangian  particle  trajectories  and  decomposed  the  trajec¬ 
tories  into  camera-induced  and  object-induced  components, 
which  makes  their  method  robust  to  camera  motion.  Wang  et 
al.  [150]  performed  tracking  on  dense  patches.  They  showed 
that  dense  trajectories  significantly  outperform  KLT  track¬ 
ing  of  sparse  key-points  on  several  human  action  recogni¬ 
tion  benchmarks.  In  addition,  a  trajectory  descriptor  called 
motion  boundary  histogram  (MBH)  was  also  introduced  in 
[150],  which  is  based  on  the  derivatives  of  optical  flow.  The 
derivatives  are  able  to  suppress  constant  motion,  making 
MBH  robust  to  camera  movement.  It  has  been  shown  to 
be  very  effective  for  action  recognition  in  realistic  videos 
[150].  Mostly  recently,  the  work  of  [54]  proposed  to  use 
local  and  global  reference  points  to  model  the  motion  of 
dense  trajectories,  leading  to  a  comprehensive  representation 
that  integrates  trajectory  appearance,  location,  and  motion. 
The  resulted  representation  is  expected  to  be  robust  to  cam¬ 
era  motion,  and  also  be  able  to  capture  the  relationships  of 
moving  objects  (or  object-background  relationships).  Very 
competitive  results  were  observed  on  several  human  action 
recognition  benchmarks. 

Summary  Spatio-temporal  visual  features  capture  mean¬ 
ingful  statistics  from  videos,  especially  those  related  to  local 
changes  or  saliency  in  both  the  spatial  and  temporal  dimen¬ 
sions.  Most  of  the  motion-based  features  are  restricted  to 
either  optical  flow  or  their  derivatives.  The  role  of  semanti¬ 
cally  more  meaningful  motion  features  (e.g.,  kinematic  fea¬ 
tures  [2,4])  is  yet  to  be  tested  in  the  context  of  this  problem. 
Furthermore,  most  of  these  feature  descriptors  capture  statis¬ 
tics  based  on  either  motion  alone,  or  motion  and  appearance 
independently.  Treating  the  motion  and  appearance  modal¬ 
ity  jointly  can  further  reveal  important  information  which  is 
lost  in  the  process.  Trajectories  computed  from  local  features 


have  been  shown  to  achieve  performance  gains  at  the  cost  of 
the  computing  overhead. 

2.3  Acoustic  features 

Acoustic  information  is  valuable  for  video  analysis,  particu¬ 
larly  when  the  videos  are  captured  under  realistic  and  uncon¬ 
strained  environments.  Mel-frequency  cepstral  coefficients 
(MFCC)  is  one  of  the  most  popular  acoustic  features  for 
sound  classification  [7,33, 163].  MFCC  represents  the  short¬ 
term  power  spectrum  of  an  audio  signal,  based  on  a  linear 
cosine  transform  of  a  log  power  spectrum  on  a  nonlinear  mel 
scale  of  frequency.  In  Xu  et  al.  [163]  used  MFCC  together 
with  another  popular  feature  zeros  crossing  rate  (ZCR)  for 
audio  classification.  Predictions  of  audio  categories  such  as 
“whistling”  and  “audience  sound”  are  used  for  detecting 
high-level  sports  events  like  “foul”,  “goal”,  etc.  Baillie  and 
Jose  [7]  used  a  similar  framework,  but  with  MFCC  features 
alone,  for  audio-based  event  recognition. 

Eronen  et  al.  [33]  evaluated  many  audio  features.  Using  a 
dataset  of  realistic  audio  contexts  (e.g.,  “road”,  “supermar¬ 
ket”  and  “bathroom”),  they  found  that  the  best  performance 
was  achieved  by  MFCC.  In  a  different  vein,  Patterson  et  al. 
[107]  proposed  the  auditory  image  model  (AIM)  to  simulate 
the  spectral  analysis,  neural  encoding,  and  temporal  inte¬ 
gration  performed  by  the  human  auditory  system.  In  other 
words,  AIM  is  a  time-domain  model  of  auditory  processing 
intended  to  simulate  the  auditory  images  humans  hear  when 
presented  with  complex  sounds  like  music,  speech,  etc.  There 
are  three  main  stages  involved  in  the  construction  of  an  audi¬ 
tory  image.  First,  an  auditory  filter  bank  is  used  to  simulate 
the  basilar  membrane  motion  (BMM)  produced  by  a  sound  in 
the  cochlea  (auditory  portion  of  the  inner  ear).  Next,  a  bank 
of  hair  cell  simulators  converts  the  BMM  into  a  simulation  of 
the  neural  activity  pattern  (NAP)  produced  at  the  level  of  the 
auditory  nerve.  Finally,  a  form  of  strobed  temporal  integra¬ 
tion  (STI)  is  applied  to  each  channel  of  the  NAP  to  stabilize 
any  repeating  pattern  and  convert  it  into  a  simulation  of  our 
auditory  image  of  the  sound.  Thus,  sequences  of  auditory 
images  can  be  used  to  illustrate  the  dynamic  response  of  the 
auditory  image  to  everyday  sounds.  A  recent  work  in  this 
direction  shows  that  features  computed  on  auditory  images 
perform  better  than  the  more  conventional  MFCC  features 
for  audio  analysis  [80]. 

Speech  is  another  acoustic  clue  that  can  be  extracted 
from  video  soundtracks.  An  early  work  by  Chang  et  al. 
[22]  reported  that  speech  understanding  is  even  more  useful 
than  image  analysis  for  sports  video  event  recognition.  They 
used  filter  banks  as  features  and  simple  template  matching 
to  detect  a  few  pre-defined  keywords  such  as  “touchdown”. 
Minami  et  al.  [91]  utilized  music  and  speech  detection  to 
assist  video  analysis,  where  the  detection  is  based  on  sound 
spectrograms.  Automatic  speech  recognition  (ASR)  has  also 
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been  used  for  years  in  the  annual  TRECVID  video  retrieval 
evaluations  [128].  A  general  conclusion  is  that  ASR  is  use¬ 
ful  for  text-based  video  search  over  the  speech  transcripts 
but  not  for  semantic  visual  concept  classification.  In  [96], 
ASR  was  found  to  be  helpful  for  a  few  events  (e.g.,  narrative 
explanations  in  procedural  videos),  however  not  for  general 
videos. 

Summary  Acoustic  features  have  been  found  useful  in 
high-level  video  event  recognition.  Although  many  new  fea¬ 
tures  have  been  proposed  in  the  literature,  currently  the  most 
popularly  used  one  is  still  the  MFCC.  Developing  new  audio 
representations  that  are  more  suitable  for  video  event  recog¬ 
nition  is  an  interesting  direction. 

2.4  Audio-visual  joint  representations 

Audio  and  visual  features  are  mostly  treated  independently 
for  multimedia  analysis.  However,  in  practice,  they  are  not 
independent  except  in  some  special  cases  where  a  video’s 
audio  channel  is  dubbed  by  an  entirely  different  audio  con¬ 
tent,  e.g.,  a  motorbike  stunt  video  dubbed  with  a  music 
track.  In  the  usual  cases,  statistical  information  such  as 
co-occurrence,  correlation,  and  covariance  causality  can  be 
exploited  across  both  audio  and  visual  channels  to  perform 
efficient  multimodal  analysis. 

In  Beal  et  al.  [13]  proposed  to  use  graphical  models  to 
combine  audio  and  visual  variables  for  object  tracking.  This 
method  was  designed  for  videos  captured  in  a  controlled 
environment  and  therefore  may  not  be  applicable  to  uncon¬ 
strained  videos.  More  recently,  Jiang  et  al.  [51]  proposed  a 
joint  audio-visual  feature,  called  audio-visual  atom  (AVA). 
An  AVA  is  an  image  region  trajectory  associated  with  both 
regional  visual  features  and  audio  features.  The  audio  fea¬ 
ture  (MFCC  of  audio  frames)  and  visual  feature  (color  and 
texture  of  short  term  region  tracks)  are  first  quantized  to  dis¬ 
crete  codewords  separately.  Jointly  occurring  audio-visual 
codeword  pairs  are  then  discovered  using  a  multiple  instance 
learning  framework.  Compared  to  simple  late  fusion  of  clas¬ 
sifiers  using  separate  modalities,  better  results  were  observed 
using  a  bag  of  AVA  representation  on  an  unconstrained  video 
dataset  [76].  This  approach  was  further  extended  in  [52], 
where  a  representation  called  audio-visual  grouplet  (AVG) 
was  proposed.  AVGs  are  sets  of  audio  and  visual  codewords. 
The  codewords  are  grouped  together  as  an  AVG  if  strong 
temporal  correlations  exist  among  them.  The  temporal  corre¬ 
lations  were  determined  using  Granger’s  temporal  causality 
[43].  AVGs  were  shown  to  be  better  than  simple  late  fusion 
of  audio-visual  features. 

The  methods  introduced  in  [51,52]  require  either  frame 
segmentation  or  foreground/background  separation,  which  is 
computationally  expensive.  Ye  et  al.  [168]  proposed  a  sim¬ 
ple  and  efficient  method  called  bi-modal  audio-visual  code¬ 
words.  The  bi-modal  words  were  generated  using  normalized 


cut  on  a  bipartite  graph  of  visual  and  audio  words,  which  cap¬ 
ture  the  co-occurrence  relations  between  audio  and  visual 
words  within  the  same  time  window.  Each  bi-modal  word  is 
a  group  of  visual  and/or  audio  words  that  frequently  co-occur 
together.  Promising  performance  was  reported  in  high-level 
event  recognition  tasks. 

Summary  Audio  and  visual  features  extracted  using  the 
methods  discussed  provide  a  promising  representation  to 
capture  the  multimodal  characteristics  of  the  video  content. 
However,  these  features  are  still  quite  limited.  For  example, 
MFCC  and  region-level  visual  features  may  not  be  the  right 
representation  for  discovering  cross-modal  correlations.  In 
addition,  the  quality  of  the  features  may  not  be  adequate  due 
to  noise,  clutter,  and  motion.  For  example,  camera  motion 
can  be  a  useful  cue  to  discriminate  between  life  events  (usu¬ 
ally  depicting  random  camera  jitter,  zoom,  pan,  and  tilt)  and 
procedural  events  (usually  static  camera  with  occasional  pan 
and/or  tilt).  None  of  the  current  feature  extraction  meth¬ 
ods  addresses  this  issue  as  the  spatio-temporal  visual  fea¬ 
ture  extraction  algorithms  are  not  capable  of  distinguishing 
between  the  movement  of  the  objects  in  the  scene  and  the 
camera  motion.  Another  disadvantage  with  feature-based 
techniques  is  that  they  are  often  ungainly  in  terms  of  both 
dimensionality  and  cardinality,  which  leads  to  storage  issues 
as  the  number  of  videos  is  phenomenal.  It  is  therefore  desired 
to  seek  an  additional  intermediate  representation  for  further 
analysis. 

2.5  Bag  of  features 

The  local  features  (e.g.,  SIFT  [77]  and  STIP  [65])  discussed 
above  vary  in  set  size,  i.e.,  the  number  of  features  extracted 
differs  across  videos  (depending  on  complexity  of  contents, 
video  duration,  etc.).  This  poses  difficulties  for  measuring 
video/frame  similarities  since  most  measurements  require 
fixed-dimensional  inputs.  One  solution  is  to  directly  match 
local  features  between  two  videos  and  determine  video  sim¬ 
ilarity  based  on  the  similarities  of  the  matched  feature  pairs. 
The  pairwise  matching  process  is  nevertheless  computation¬ 
ally  expensive,  even  with  the  help  of  indexing  structures. 
This  issue  can  be  addressed  using  a  framework  called  bag- 
of-features  or  bag-of-words  (BoW)  [127].  Motivated  by  the 
well-known  bag-of-words  representation  of  textual  docu¬ 
ments,  BoW  treats  images  or  video  frames  as  “documents” 
and  uses  a  similar  word  occurrence  histogram  to  represent 
them,  where  the  “visual  vocabulary”  is  generated  by  clus¬ 
tering  a  large  set  of  local  features  and  treating  each  cluster 
center  as  a  “visual  word”. 

BoW  has  been  popular  in  image/video  classification  for 
years.  The  performance  of  BoW  is  sensitive  to  many  imple¬ 
mentation  choices,  which  have  been  extensively  studied  in 
several  works,  mostly  in  the  context  of  image  classifica¬ 
tion  with  the  frame-based  local  features  like  SIFT.  Zhang 
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et  al.  [174]  evaluated  various  local  features  and  reported 
competitive  object  recognition  performance  by  combining 
multiple  local  patch  detectors  and  descriptors.  Jiang  et  al. 
[55]  conducted  a  series  of  analysis  on  several  choices  of 
BoW  in  video  concept  detection,  including  term  weight¬ 
ing  schemes  (e.g.,  term  frequency  and  inverse  document 
frequency)  and  vocabulary  size  (i.e.,  the  number  of  clus¬ 
ters).  An  important  finding  is  that  term  weighting  is  very 
important,  and  a  soft-weighting  scheme  was  proposed  to 
alleviate  the  effect  of  quantization  error  by  softly  assign¬ 
ing  a  descriptor  to  multiple  visual  words.  The  usefulness  of 
such  a  soft  weighting  scheme  is  also  confirmed  in  detect¬ 
ing  a  large  number  of  concepts  in  TRECVID  [21].  Simi¬ 
lar  idea  of  soft  assignment  was  also  presented  by  Philbin 
et  al.  [109].  van  Gemert  et  al.  [41]  proposed  an  interest¬ 
ing  approach  called  kernel  codebooks  to  tackle  the  same 
issue  of  quantization  loss.  For  vocabulary  size,  a  gen¬ 
eral  observation  is  that  a  few  hundreds  to  several  thou¬ 
sands  of  visual  words  might  be  sufficient  for  most  visual 
classification  tasks.  In  addition,  Liu  and  Shah  proposed 
to  apply  maximization  of  mutual  information  (MMI)  for 
visual  word  generation  [75].  Compared  to  typical  meth¬ 
ods  like  k-means,  MMI  is  able  to  produce  a  higher  level 
of  word  clusters,  which  are  semantically  more  meaning¬ 
ful  and  also  more  discriminative  for  visual  recognition. 
A  feature  selection  method  based  on  the  page-rank  idea 
was  proposed  in  [74]  to  remove  local  patches  that  may 
hurt  the  performance  of  action  recognition  in  unconstrained 
videos.  Feature  selection  techniques  were  also  adopted 
to  choose  discriminative  visual  words  for  video  concept 
detection  [56]. 

Spatial  locations  of  the  patches  are  ignored  in  standard 
BoW  representation,  which  is  not  ideal,  since  the  patch  loca¬ 
tions  convey  useful  information.  Lazebnik  et  al.  [68]  adopted 
a  similar  scheme  like  some  of  the  global  representations,  by 
partitioning  a  frame  into  rectangular  grids  at  various  lev¬ 
els,  and  computing  a  BoW  histogram  for  each  grid.  The  his¬ 
tograms  from  grids  at  each  level  are  then  concatenated  as  a 
feature  vector  and  pyramid  match  kernel  is  applied  to  mea¬ 
sure  the  similarity  between  frames,  each  with  multiple  fea¬ 
tures  from  different  spatial  partitioning  levels.  This  simple 
method  has  been  proved  effective  in  many  applications  and 
is  now  widely  adopted.  Researchers  also  found  that  direct 
concatenation  of  BoW  histograms  from  grids  of  all  levels 
plus  support  vector  machines  (S  VMs)  learning  with  standard 
kernels  offers  similar  performance  to  the  pyramid  match  ker¬ 
nel.  However,  it  is  worth  noting  that,  different  from  BoW  of 
the  2D  frames,  the  spatial  pyramid  architecture  has  rarely 
been  used  in  spatio-temporal  feature-based  BoW  representa¬ 
tions,  again  indicating  the  difficulty  in  handling  the  temporal 
dimension  of  videos. 

The  BoW  framework  can  also  be  extended  to  represent  a 
sound  as  a  bag  of  audio  words  (a.k.a.  bag-of-frames  in  the 


audio  community),  where  the  acoustic  features  are  computed 
locally  from  short-term  auditory  frames  (a  time  window  of 
tens  of  milliseconds),  resulting  in  a  set  of  auditory  descrip¬ 
tors  from  a  sound.  This  representation  has  been  studied  in 
several  works  for  audio  classification,  where  the  implementa¬ 
tion  choices  slightly  differ  from  that  of  the  visual  feature  rep¬ 
resentations.  Mandel  et  al.  [82]  used  Gaussian  mixture  mod¬ 
els  (GMM)  to  describe  a  song  as  a  bag  of  frames  for  classifi¬ 
cation.  Aucouturier  et  al.  [5]  and  Lee  et  al.  [70]  also  adopted  a 
similar  representation.  Aucouturier  et  al.  conducted  an  inter¬ 
esting  set  of  experiments  and  found  that  bag-of-frames  per¬ 
forms  well  for  urban  soundscapes  but  not  for  polyphonic 
music.  In  place  of  GMM,  Lu  et  al.  [78]  adopted  spectral 
clustering  to  generate  auditory  keywords.  Promising  audio 
retrieval  performance  was  attained  using  their  proposed  rep¬ 
resentation  on  sports,  comedy,  award  ceremony,  and  movie 
videos.  Cotton  et  al.  [26]  proposed  to  extract  sparse  transient 
features  corresponding  to  soundtrack  events,  instead  of  the 
uniformly  and  densely  sampled  audio  frames.  They  reported 
that,  with  fewer  descriptors,  transient  features  produce  com¬ 
parable  performance  to  the  dense  MFCCs  for  audio-based 
video  event  recognition,  and  the  fusion  of  both  can  lead  to 
further  improvements.  The  bag-of-audio- words  representa¬ 
tion  has  also  been  adopted  in  several  video  event  recognition 
systems  with  promising  performance  (e.g.,  [10,58],  among 
others). 

Figure  5  shows  a  general  framework  of  BoW  representa¬ 
tion,  using  different  audio-visual  features.  A  separate  vocab¬ 
ulary  is  constructed  for  each  feature  type,  by  clustering  the 
corresponding  feature  descriptors.  Finally,  a  BoW  histogram 
is  generated  for  each  feature  type.  Histograms  can  then  be 
normalized  to  create  multiple  representations  of  the  input 
video.  In  the  simplest  case,  all  the  histogram  representa¬ 
tions  can  be  concatenated  to  create  a  final  representation 
before  classification.  This  approach  is  usually  termed  as  early 
fusion.  An  alternative  approach  is  the  late  fusion  where  the 
histogram  representations  are  independently  fed  into  classi¬ 
fiers  and  decisions  from  the  classifiers  are  combined.  These 
will  be  discussed  in  detail  later  in  Sect.  3.4. 

Summary  As  it  is  evident  that  there  is  no  single  feature 
that  is  sufficient  for  high-level  event  recognition,  current 
research  strongly  suggests  the  joint  use  of  multiple  fea¬ 
tures,  such  as  static  frame-based  features,  spatio-temporal 
features,  and  acoustic  features.  However,  whether  BoW  is  the 
best  model  to  obtain  meaningful  representations  of  a  video 
remains  an  important  open  issue.  Although  this  technique 
performs  surprisingly  well  [24,58,96,98],  the  major  draw¬ 
back  of  the  systems  conforming  to  this  paradigm  is  their 
incapability  to  obtain  deep  semantic  understanding  of  the 
videos,  which  is  a  prevalent  issue  in  high-level  event  analy¬ 
sis.  This  is  because,  they  provide  a  compact  representation  of 
a  complex  event  depicted  in  a  video  based  on  the  underlying 
features  without  having  any  understanding  of  the  hierarchical 
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Fig.  5  Bag-of-features 
representations  obtained  from 
different  feature  modalities  for 
high-level  event  detection 


components,  such  as  interactions  or  actions  that  constitute  the 
complex  event.  Needless  to  say,  the  sense  of  spatio-temporal 
localization  of  these  components  is  lost  in  this  coarse  repre¬ 
sentation.  Besides,  these  methods  also  suffer  from  the  usual 
disadvantages  of  quantization  used  in  converting  raw  features 
to  discrete  codewords  as  pointed  out  in  [16, 109]. 


3  Recognition  methods 

Given  the  feature  representations,  event  recognition  can  be 
achieved  by  various  classifiers.  This  is  a  typical  machine 
learning  process,  where  a  set  of  annotated  videos  are  given 
for  model  training.  The  models  are  then  applied  to  new 
videos  for  event  recognition.  We  divide  the  discussion  of 
recognition  methods  into  four  subsections.  Section  3.1  intro¬ 
duces  kernel  classifiers,  where  we  mainly  discuss  SVM,  the 
most  popular  classifier  in  current  event  recognition  systems. 
Section  3.2  discusses  graphical  methods,  which  are  able  to 
explicitly  model  temporal  relationships  between  low-level 
events.  Section  3.3  describes  knowledge-based  techniques, 
which  can  incorporate  domain  knowledge  into  event  recog¬ 
nition.  In  Sect.  3.4,  we  discuss  several  fusion  techniques  to 
explore  the  power  of  combining  multimodal  features. 

3.1  Kernel  classifiers 

Kernel-based  classifiers  have  been  popular  in  a  wide  range  of 
applications  for  many  years  [45].  With  kernels,  linear  classi¬ 
fiers  that  have  been  comprehensively  studied  can  be  applied 
in  kernel  space  for  nonlinear  classification,  which  often  leads 
to  significantly  improved  performance.  Among  many  choices 
of  kernel-based  classifiers  (e.g.,  kernel  Fisher  discriminants), 
SVM  is  the  most  widely  used  algorithm  due  to  its  reliable  per¬ 
formance  across  many  different  tasks,  including  high-level 


video  event  recognition.  In  the  following  sections,  we  dis¬ 
cuss  several  issues  related  to  applying  SVM  for  video  event 
recognition. 

3.1.1  Direct  classification 

Event  recognition  is  often  formulated  as  a  one-versus-all 
manner  based  on  low-level  representations,  where  a  two- 
class  SVM  is  trained  to  classify  each  event.  For  two-class 
SVM,  the  decision  function  for  a  feature  vector  x  of  a  test 
video  has  the  following  form: 

/(x)  =  ^a,y;/C(xi,  x)  -  b,  (1) 

i 

where  /C(x; ,  x)  is  the  output  of  a  kernel  function  for  the  fea¬ 
ture  of  the  zth  training  video  x*  and  the  test  sample  x;  yt  is 
the  event  class  label  of  x* ;  07  is  the  learned  weight  of  the 
training  sample  x/ ;  and  b  is  a  learned  threshold  parameter. 

Choosing  an  appropriate  kernel  function  /C(x,  y)  is  critical 
to  the  classification  performance.  For  BoW  representations 
of  feature  descriptors  like  SIFT  or  STIP,  it  has  been  reported 
that  x2  Gaussian  kernel  is  suitable  [55, 150, 174],  defined  as 

JC(x,  y)  =  e-pdx2(x’y\  (2) 

where  Jy2(x,  y)  =  V  _■  (Xj  ,Vj)  is  the  y2  distance  between 

a  ^ j  xj  ■+■  y  j 

samples  x  and  y.  x2  Gaussian  kernel  was  employed  in  all  the 
recently  developed  top-performing  high-level  event  recogni¬ 
tion  systems  [10,48,58,96,98]. 

The  performance  of  SVM  classification  is  sensitive  to  a 
few  parameters,  among  which  the  most  critical  one  is  p  in 
the  kernel  function.  The  selection  of  a  suitable  parameter 
depends  on  data  distribution,  which  varies  from  task  to  task. 
A  common  way  is  to  use  cross-validation,  which  evaluates  a 
wide  range  of  parameter  values  and  picks  the  best  one.  How¬ 
ever,  this  strategy  is  time-consuming.  Recently,  researchers 
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Fig.  6  Responses  of  lower  level  concept  detectors  in  an  arbitrary  video  depicting  a  complex  event  “changing  a  tire”.  This  figure  is  best  viewed  in 
color.  See  texts  for  more  explanations. 


have  empirically  found  that  setting  p  as  1/d  often  leads  to 
near-optimal  performance  [174],  where  d  is  the  mean  of  pair¬ 
wise  distances  among  all  training  samples.  This  simple  strat¬ 
egy  is  currently  widely  adopted. 

While  accumulating  all  the  feature  descriptors  from  a 
video  into  a  single  feature  vector  seems  a  reasonable  choice 
for  event  recognition,  it  neglects  the  temporal  information 
within  the  video  sequence.  This  issue  can  be  addressed  by 
using  graphical  models  as  will  be  described  in  Sect.  3.2. 
Another  feasible  solution  is  to  use  the  earth  mover’s  distance 
(EMD)  [115]  to  measure  video  similarity.  EMD  computes 
the  optimal  flows  between  two  sets  of  frames/clips,  produc¬ 
ing  the  optimal  match  between  the  two  sets.  Incorporating 
the  EMD  into  a  SVM  classifier,  the  goal  of  temporal  event 
matching  can  be  achieved  to  a  certain  extent.  This  method 
was  originally  proposed  by  Xu  et  al.  [162]  for  event  recog¬ 
nition  in  broadcast  news  videos  with  promising  results. 

3.1.2  Hierarchical  classification  using  concept-based 
recognition 

Approaches  under  the  direct  classification  category  work 
satisfactorily  to  some  extent.  As  discussed  earlier,  they  are 
incapable  of  providing  understanding  of  the  semantic  struc¬ 
ture  present  in  a  complex  event.  Consider  event  “changing  a 
vehicle  tire”,  which  typically  consists  of  semantically  lower 
level  classes  such  as  “person  opening  car  trunk”,  “person 
using  wrench”,  “person  jacking  car”,  etc.  A  bag  of  words 
representation  collapses  information  into  a  long  feature  vec¬ 
tor  followed  by  direct  classification  is  apparently  not  able  to 
explain  the  aforementioned  semantic  structure. 

This  has  motivated  researchers  to  explore  how  an 
alternative  representation  could  be  efficiently  utilized  for 


semantic  analysis  of  complex  events.  Events  can  be  mostly 
characterized  by  several  moving  objects  (person,  vehicle, 
etc.),  and  generally  occur  under  particular  scene  settings 
(kitchen,  beach,  mountain,  etc.)  with  certain  audio  sounds 
(metallic  clamor,  wooden  thud,  cheering,  etc.)  and  cues  from 
overlaid  or  scene  texts  (street  names,  placards,  etc.).  Detec¬ 
tion  of  these  intermediate  concepts  has  been  proved  to  be 
useful  for  high-level  event  recognition. 

Figure  6  gives  an  example  of  concept  detection  results  in 
a  video  of  event  “changing  a  tire”,  where  the  top  row  shows 
sampled  frames.  The  blue  horizontal  bar  gives  a  sense  of 
the  temporal  sampling  window,  on  which  pre-trained  con¬ 
cept  detectors  are  applied.  The  smaller  green  horizontal  bars 
correspond  to  the  actual  granularity  of  the  lower  level  con¬ 
cept  classes  (obtained  from  manual  annotation).  The  bot¬ 
tom  3  rows  show  the  detector  responses  from  different  fea¬ 
ture  modalities  (each  vertical  bar  indicates  a  concept).  After 
combining  the  responses  of  concept  detectors  from  differ¬ 
ent  modalities,  we  observe  that  the  concept  “person  opens 
trunk”  is  detected  with  maximum  confidence  in  the  shown 
window.  This  is  very  close  to  the  ground  truth.  Similar  trend  is 
observed  for  other  concepts  like  “person  fitting  bolts”,  “per¬ 
son  squatting”  and  “person  turning  wrench”,  which  are  all 
very  relevant  to  the  event  “changing  a  tire”. 

A  few  efforts  have  been  devoted  to  the  definition  of  a  suit¬ 
able  set  of  concepts.  One  representative  work  is  the  LSCOM 
ontology  [95],  which  defined  1,000+  concepts  by  carefully 
considering  their  utility  for  video  retrieval,  feasibility  of 
automatic  detection,  and  observability  in  actual  datasets. 
In  addition,  several  works  directly  adopted  the  WordNet 
ontology  (e.g,  ImageNet  [28]).  A  simple  and  popular  way 
to  utilize  these  concepts  in  event  recognition  is  to  adopt  a 
two-layer  SVM  classification  structure  [10,24,96,98],  where 
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each  model  in  the  first  layer  detects  a  semantic  concept, 
and  a  second-level  model  is  used  to  recognize  event  classes 
using  a  representation  based  on  the  first-layer  outputs  as  fea¬ 
ture.  All  these  works  [10,96,98]  have  reported  notable  but 
small  performance  gains  from  this  hierarchical  classification 
approach,  after  fusing  it  with  direct  event  classification  using 
low-level  features  like  SIFT  and  STIR 

Once  such  an  intermediate  representation  is  established, 
there  are  a  variety  of  techniques  that  can  be  applied  for  com¬ 
plex  event  detection.  The  idea  of  two-layer  classification  can 
be  extended  to  model  a  more  sophisticated  event  model  using 
co-occurrence  and  covariance  of  concepts.  In  order  to  fur¬ 
ther  exploit  temporal  dependencies  between  the  constituting 
concepts,  graphical  models  can  be  employed.  A  detailed  dis¬ 
cussion  on  the  use  of  graphical  models  will  be  provided  in 
Sect.  3.2. 

The  concept-based  hierarchical  classification  framework 
has  several  advantages  over  direct  classification  approaches 
using  low-level  features.  As  described  earlier,  this  methodol¬ 
ogy  decomposes  a  complex  event  into  several  semantically 
meaningful  entities,  where  many  of  the  lower  level  con¬ 
cepts  may  be  easier  to  be  detected  since  training  samples  of 
these  concepts  are  relatively  less  diverse  or  noisy.  Also,  this 
framework  can  be  extended  to  discover  ad  hoc  events  with 
few  exemplars,  if  the  lower  level  concepts  can  be  reliably 
detected.  Furthermore,  hierarchical  classification  also  paves 
a  way  for  event  recounting  as  detection  of  these  concepts 
can  provide  detailed  information.  However,  despite  of  all  the 
advantages,  there  are  also  a  few  drawbacks.  First,  the  con¬ 
cept  detectors  alone  require  intense  training  and  they  are  not 
guaranteed  to  perform  consistently  across  different  datasets. 
Second,  obtaining  high  quality  training  data  is  a  challenging 
and  time-consuming  task. 

Summary  Research  of  hierarchical  event  classification 
has  not  been  explored  heavily.  Considering  that  the  cur¬ 
rent  approaches  can  only  improve  direct  classification  by 
a  small  gain,  we  believe  that  there  is  still  room  for 
substantial  improvement  from  this  recognition  paradigm. 
Currently,  concepts  such  as  human  faces,  pedestrians,  and 
simple  scenes/objects  can  be  detected  fairly  reliably.  We  con¬ 
jecture  that  as  more  concepts  are  reliably  detected,  and  as 
more  advanced  hierarchical  models  are  designed,  much  more 
significant  performance  gain  will  be  achieved.  In  addition, 
current  selections  of  mid-level  concepts  are  still  ad  hoc.  A 
systematic  way  in  discovering  relevant  concepts  and  con¬ 
structing  suitable  ontological  structures  among  concepts  is 
lacking. 

3.2  Graphical  models 

There  has  been  a  plethora  of  literature  over  the  last  few 
decades  which  advocate  the  use  of  graphical  models  for 
the  analysis  of  sequential  data.  Most  approaches  under  this 


Fig.  7  An  illustration  of  a  typical  discrete  HMM.  The  model  parame¬ 
ters  can  be  obtained  from  model  training 


category  combine  insights  from  probability  and  graph  the¬ 
ory  to  find  structure  in  sequential  data.  These  approaches 
can  be  broadly  categorized  into  two  sub-categories:  directed 
graphical  models  and  undirected  graphical  models.  Popular 
methods  of  the  former  category  include  hidden  Markov  mod¬ 
els  (HMMs),  Bayesian  networks  (BNs)  and  their  variants. 
Markov  random  fields  (MRFs),  Conditional  random  fields 
(CRFs),  etc.  belong  to  the  latter. 

The  simplest  case  of  a  directed  graphical  model  is  an 
HMM  which  adapts  a  single  layer  state- space  formulation, 
wherein  the  outcome  of  an  observed  current  state  depends 
upon  its  immediately  previous  state.  Observations  can  either 
be  represented  as  discrete  symbols  (discrete  HMM)  or  a  con¬ 
tinuous  distribution  (continuous  HMM).  A  discrete  HMM  is 
explained  in  Fig.  7  where  circular  elements  denote  the  hid¬ 
den  states,  blue  arrows  denote  the  transitions  between  state 
pairs,  gray  rectangular  elements  are  the  observed  symbols 
and  the  black  arrows  show  the  observation  likelihood  of  a 
symbol  given  a  state.  Note  that  the  directed  arrows  in  the 
graph  shown  in  Fig.  7  represent  the  transition  between  the 
hidden  states  and  the  observed  states.  In  the  context  of  com¬ 
plex  event  recognition,  a  directed  graphical  model  is  char¬ 
acterized  by  directed  acyclic  graphs  which  can  be  used  to 
represent  state- space  relationships  between  constituent  lower 
level  events  or  sub-events. 

The  application  of  directed  graphical  models  in  activity  or 
event  recognition  can  be  traced  back  to  the  work  of  Yamato  et 
al.  [164],  where  the  authors  proposed  HMMs  for  recognizing 
tennis  actions  such  as  service,  forehand  volley,  smash,  etc. 
In  their  method,  they  extracted  human  figures  by  a  standard 
background  subtraction  technique  and  binarized  the  resulting 
image.  Mesh  features  on  8  x  8  binary  patches  were  used  to 
represent  each  image  frame.  These  features  were  then  trans¬ 
formed  to  a  symbol  sequence  where  each  symbol  encodes  a 
keyframe  in  the  input  image  sequence.  For  each  action  class, 
a  separate  discrete  HMM  was  trained  using  the  transformed 
symbol  sequences. 
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Over  the  past  2  decades,  several  other  works  [71,97, 
131,160]  have  used  HMMs  and  their  variants  in  human 
action  recognition.  Starner  and  Pentland  [131]  were  among 
the  early  adopters  of  HMMs  in  their  research  on  sign  lan¬ 
guage  recognition.  Xie  et  al.  [160]  demonstrated  how  HMMs 
and  hierarchical  composition  of  multiple  levels  of  HMMs 
could  be  efficiently  used  to  classify  play  and  non-play  seg¬ 
ments  of  soccer  videos.  Motivated  by  the  success  of  HMMs, 
Li  et  al.  [71]  introduced  an  interesting  methodology  to  model 
an  action  where  hidden  states  in  HMMs  were  replaced  by 
visualizable  salient  poses  (which  forms  an  action)  estimated 
using  Gaussian  mixture  models.  Since  states  in  HMMs  are 
not  directly  observable,  mapping  them  to  poses  is  an  interest¬ 
ing  idea.  In  the  work  by  Natarajan  and  Nevada  [97],  an  action 
is  modeled  by  a  top-down  approach,  where  the  topmost  level 
represents  composite  actions  containing  a  single  Markov 
chain,  and  the  middle  level  represents  primitive  actions  mod¬ 
eled  using  a  variable  transition  HMM,  followed  by  simple 
HMMs  that  form  the  bottommost  layer  representing  human 
pose  transitions.  Recently,  Inoue  et  al.  [48]  reported  promis¬ 
ing  results  in  TRECVID  MED  task  [99]  using  HMMs  to 
characterize  audio  which  is  often  observed  to  be  a  useful  cue 
in  multimedia  analysis. 

There  are  other  types  of  directed  graphical  models  that 
have  been  studied  in  event  recognition.  Another  disadvan¬ 
tage  with  the  HMM  formulation  is  its  incapability  to  model 
causality.  This  problem  is  alleviated  by  a  different  kind  of 
directed  graphical  model  called  Bayesian  networks  (BN). 
BNs  are  capable  of  efficiently  modeling  causality  using 
conditional  independence  between  states.  This  methodology 
facilitates  semantically  and  computationally  efficient  factor¬ 
ization  of  observation  state  space.  In  this  vein,  Intille  and 
Bobick  [49]  introduced  an  agent-based  probabilistic  frame¬ 
work  that  exploits  the  temporal  structure  of  complex  activi¬ 
ties  typically  depicted  in  American  football  plays.  They  used 
noisy  trajectory  data  from  soccer  players  collected  from  a 
static  overhead  camera  to  obtain  temporal  (e.g.,  before  or 
after)  and  logical  (e.g.,  pass  or  no  pass)  relationships,  which 
are  then  used  to  model  interactions  between  multiple  agents. 
Finally,  the  BNs  are  applied  to  identify  10  types  of  strategic 
plays. 

BNs  cannot  implicitly  encapsulate  temporal  information 
between  different  nodes  or  states  in  the  finite  state  machine 
model.  Dynamic  Bayesian  networks  (DBNs)  can  achieve 
this  by  exploiting  the  factorization  principles  available  in 
Bayesian  methods  while  preserving  the  temporal  structure. 
Research  on  event  recognition  using  DBNs  is  relatively  new 
as  compared  to  other  approaches  since  it  requires  a  certain 
amount  of  domain  knowledge.  Huang  et  al.  [47]  presented 
a  framework  for  semantic  analysis  of  soccer  videos  using 
DBNs,  where  they  successfully  recognized  events  such  as 
corner  kicks,  goals,  penalty  kicks,  etc. 


BNs,  HMMs  and  their  variants  fall  under  the  philosophy 
of  generative  classification,  which  models  the  input,  reducing 
variance  of  parameter  estimation  at  the  expense  of  possibly 
introducing  model  bias.  Because  of  the  generative  nature  of 
the  model,  a  distribution  is  learned  over  the  possible  obser¬ 
vations  given  the  state.  However,  during  inference  or  clas¬ 
sification,  it  is  the  observation  that  is  provided.  Hence,  it  is 
more  intuitive  to  condition  on  the  observation,  rather  than  the 
state. 

This  has  motivated  researchers  to  investigate  alternative 
strategies  for  modeling  complex  events  using  undirected 
graphical  models,  some  of  which  are  naturally  suited  for 
discriminative  modeling  tasks.  To  this  end,  Vail  et  al.  [144] 
made  a  strong  contribution  by  introducing  Conditional  Ran¬ 
dom  Fields  for  activity  recognition.  In  their  work,  the  authors 
show  that  CRFs  can  be  discriminatively  trained  based  on 
conditioning  on  the  entire  observation  sequence  rather  than 
individually  observed  sample.  A  CRF  can  be  perceived  as  a 
linear  chain  HMM  without  any  directional  edges  between  the 
hidden  states  and  observations.  In  case  of  HMMs,  the  model 
parameters  (transition,  emission  probabilities)  are  learned  by 
maximizing  the  joint  probability  distribution,  whereas,  the 
parameters  of  a  CRF  (potentials)  are  learned  by  maximizing 
the  conditional  probability  distribution.  As  a  consequence, 
while  learning  the  parameters  of  a  CRF,  modeling  the  dis¬ 
tribution  of  the  observations  is  not  taken  under  considera¬ 
tion.  The  authors  of  [144]  produced  convincing  evidence  in 
favor  of  CRFs  against  HMMs  in  context  of  activity  recogni¬ 
tion.  Inspired  by  the  success  of  [144],  Wang  and  Suter  [153] 
introduced  a  variant  of  CRFs  which  can  efficiently  model 
the  interactions  between  temporal  order  of  human  silhouette 
observations  for  complex  event  recognition.  Wang  and  Mori 
[154]  extended  the  idea  of  general  CRFs  to  a  max-margin 
hidden  CRF  for  classification  of  human  actions,  where  they 
model  a  human  action  as  a  global  root  template  and  a  con¬ 
stellation  of  several  “parts”.  More  recently,  in  [25],  Conolly 
proposed  modeling  and  recognition  of  complex  events  using 
CRF,  by  taking  observations  obtained  from  multiple  stereo 
systems  under  surveillance  domain. 

Although  undirected  graphical  models  (CRFs)  are  far  less 
complex  than  their  directed  counterparts  (DBNs),  and  avail 
all  the  benefits  of  discriminative  classification  techniques, 
they  are  disadvantageous  in  situations  where  the  dependency 
between  an  event/action  and  its  predecessors  or  successors 
(e.g.,  cause  and  effect)  needs  to  be  modeled.  Although  some 
variants  of  CRFs  can  overcome  this  problem  by  incorporat¬ 
ing  additional  constraints  and  complex  parameter  learning 
techniques,  they  are  computationally  slow. 

Summary  Graphical  models  build  a  factorized  represen¬ 
tation  of  a  set  of  independencies  between  components  of 
complex  events.  Although  the  approaches  discussed  under 
this  section  are  mathematically  and  computationally  elegant, 
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their  success  in  complex  event  recognition  is  still  inconclu¬ 
sive.  However,  since  these  models  provide  an  implicit  level 
of  abstraction  in  understanding  complex  events,  research  in 
this  direction  is  expected  to  gather  impetus  as  fundamental 
problems  such  as  feature  extraction  and  concept  detection 
become  more  mature.  With  this  said,  we  now  move  on  to  an 
alternative  approach  towards  building  this  abstraction  using 
knowledge-based  approaches,  in  particular,  techniques  fre¬ 
quently  employed  in  natural  language  processing. 

3.3  Knowledge-based  techniques 

Knowledge-based  techniques  normally  involve  the  construc¬ 
tion  of  event  models,  and  are  usually  used  in  special  domains 
such  as  airport  or  retail  surveillance,  parking  lot  security  and 
so  on,  where  high-level  semantic  knowledge  can  be  spec¬ 
ified  by  the  relevant  domain  experts.  Let  us  consider  the 
case  of  parking  lot  security.  The  usual  events  are  mostly 
deterministic  in  nature,  e.g.,  the  “parking  a  vehicle”  event 
would  typically  include  the  following  sub-events:  “vehicle 
enters  garage  through  entry”,  “vehicle  stopping  near  park¬ 
ing  space”,  “person  coming  out  of  car”,  and  “person  exiting 
garage”.  The  temporal  and  spatial  constraints  for  each  of 
these  lower  level  concepts  are  known  to  a  domain  expert. 
For  example,  a  vehicle  can  stop  in  the  the  driveway  (spa¬ 
tial)  for  only  a  small  amount  of  time  (temporal).  Thus,  any 
violations  to  these  specified  constraints/knowledge  would  be 
considered  as  an  outlier  by  the  event  model. 

The  knowledge  of  the  context  of  an  event  is  an  extremely 
useful  cue  towards  understanding  the  event  itself.  Resear¬ 
chers  have  extensively  used  domain  knowledge  to  model 
events  using  different  approaches.  The  work  of  Francois 
et  al.  [39]  is  noteworthy  as  the  authors  attempted  to  envi¬ 
sion  an  open  standard  for  understanding  video  events.  One 
important  part  of  the  work  is  the  modeling  of  events  as  com- 
posable,  whereby  complex  events  are  constructed  from  sim¬ 
pler  ones  by  operations  such  as  sequencing,  iteration,  and 
alternation.  In  addition,  the  authors  compiled  an  ontolog¬ 
ical  framework  for  knowledge  representation  called  video 
event  representation  language  (VERL)  based  on  foundations 
of  formal  language.  In  VERL,  they  described  complex  events 
by  composing  simpler  primitive  events,  where  sequencing  is 
the  most  common  composition  operation.  For  example,  an 
event  involving  a  person  getting  out  of  a  car  and  going  into 
a  building  is  described  by  the  following  sequence:  opening 
car  door,  getting  out  of  car,  closing  car  door  (optional),  walk¬ 
ing  to  building,  opening  building  door,  and  entering  building. 
The  authors  also  provided  an  accompanying  framework  for 
video  event  annotations  known  as  video  event  markup  lan¬ 
guage  (VEML). 

Since  a  complex  event  can  be  formulated  as  a  sequence 
of  lower  level  or  primitive  concepts,  production  rules  from 


formal  language  can  be  applied  to  generalize  such  complex 
events.  Analogous  to  the  language  framework,  where  the 
concepts  can  be  referred  to  as  terminals  while  events  can 
be  named  as  non-terminals,  the  production  rules  can  be  aug¬ 
mented  by  probabilities  of  occurrence  of  terminals  or  non¬ 
terminals  using  stochastic  context  free  grammar  (SCFG)  and 
their  temporal  relationships,  e.g.,  a  terminal  precedes  another 
using  Allen’s  temporal  algebra  [3].  In  this  section,  we  sum¬ 
marize  a  few  approaches  that  were  proposed  in  the  context 
of  event  recognition. 

Ivanov  and  Bobbick  [50]  proposed  a  probabilistic  syntac¬ 
tic  approach  to  recognize  complex  events  using  the  SCFG. 
The  idea  was  integrated  into  a  real-time  system  to  demon¬ 
strate  the  approach  in  video  surveillance  application.  Soon 
after,  Moore  and  Essa  [92]  derived  production  rules  using 
SCFG  to  represent  multi-tasked  activities  as  a  sequence  of 
object  contexts,  image  features,  and  motion  appearances 
from  exemplars.  In  a  similar  note,  Ryoo  and  Aggarwal  [117] 
proposed  the  use  of  context  free  grammar,  to  represent  an 
event  as  temporal  processes  consisting  of  poses,  gestures,  and 
sub-events.  A  specialized  case  of  SCFG  is  attribute  grammar, 
where  the  conditions  on  each  production  rule  can  be  aug¬ 
mented  with  additional  semantics  from  prior  knowledge  of 
the  event  domain.  This  has  been  used  by  Joo  and  Chellappa 
[59]  who  attempted  to  recognize  atomic  events  in  parking  lot 
surveillance. 

The  influence  of  Case  Grammar  on  contemporary  linguis¬ 
tics  has  been  significant.  In  linguistics,  a  case  frame  describes 
important  aspects  of  semantic  valency,  of  verbs,  adjectives, 
and  nouns  [37].  For  instance,  the  case  frame  of  a  verb  “give” 
includes  an  Agent  (A),  an  Object  (O),  and  a  Beneficiary  (B), 
e.g.,  “NIST  (A)  gave  video  data  (O)  to  the  active  participants 
(B)”.  This  has  inspired  the  development  of  frame-based  rep¬ 
resentations  in  Artificial  Intelligence.  In  [44],  the  authors 
extended  the  idea  of  case  frame  to  represent  complex  event 
models.  Case  frame  was  extended  to  model  the  importance  of 
causal  and  temporal  relationships  between  low-level  events. 
Multi-agent  and  multi-threaded  events  are  represented  using 
a  hierarchical  case  frame  representation  of  events  in  terms 
of  low-level  events  and  case-lists.  Thus,  a  complex  event  is 
represented  using  a  temporal  tree  structure,  thereby  formulat¬ 
ing  event  detection  as  subtree  pattern  matching.  The  authors 
show  two  important  applications  of  the  proposed  event  rep¬ 
resentation  for  the  automated  annotation  of  standard  meeting 
video  sequences,  and  for  event  detection  in  extended  videos 
of  railroad  crossings. 

More  recently,  Si  et  al.  [125]  introduced  AND-OR  graphs 
to  learn  the  event  grammar  automatically  using  a  pre¬ 
specified  set  of  unary  (agent,  e.g.,  person  bending  torso)  and 
binary  (agent-environment,  e.g.,  person  near  the  trash  can) 
relations  detected  for  each  video  frame.  They  demonstrated 
how  the  learned  grammar  can  be  used  to  rectify  the  noisy 
detection  of  lower  level  concepts  in  office  surveillance. 
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Fig.  8  An  illustration  of  a  typical  place  transition  net 

Efforts  have  also  been  made  to  represent  and  detect  com¬ 
plex  events  based  on  first-order  logic,  generally  known  as 
Markov  Logic  Networks  (MLNs).  MLNs  can  be  interpreted 
as  graphs  satisfying  Markovian  properties  whose  nodes  are 
atomic  formulas  from  first-order  logic  and  the  edges  are  the 
logical  connectives  used  to  construct  the  formulas.  Tran  and 
Davis  [137]  adopted  MLNs  to  model  complex  interactions 
between  people  in  parking  lot  surveillance  scenarios  by  inte¬ 
grating  common  sense  reasoning,  e.g.,  a  person  can  drive 
only  one  car  at  a  time. 

Knowledge  representation  can  be  achieved  using  networks 
or  graphical  structure.  Ghanem  et  al.  [42]  used  place  transi¬ 
tion  networks  or  Petri  Nets  (PTNs)  [20].  A  Petri  net  pro¬ 
vides  an  abstract  model  to  represent  the  flow  of  information 
using  a  directed  graphical  model  contrary  to  approaches  that 
use  undirected  graphical  models  (e.g.,  HMMs),  leading  to  a 
logical  inferencing  framework.  PTNs  can  be  explained  with 
Pig.  8,  where  the  hollow  circles  denote  places  containing 
solid  circled  tokens ,  rectangles  depict  transition  and  directed 
arrows  are  called  arcs  to  show  the  direction  of  the  flow.  A 
change  in  the  distribution  of  tokens  inside  a  place  node  trig¬ 
gers  a  transition.  This  framework  was  used  for  event  model¬ 
ing  by  Cassel  et  al.  [20]  and  later  extended  in  [42]  for  parking 
lot  surveillance.  Here  objects  such  as  cars  and  humans  were 
treated  as  tokens,  single  object  or  two  object  conditions  such 
as  moving/stationary  and  spatially  near/far  were  considered 
the  places,  and  primitive  events  such  as  start,  stop,  accelerate, 
and  decelerate  were  the  transitions  between  one  place  node  to 
another.  An  example  PTN  model  that  exploits  domain  knowl¬ 
edge  for  counting  the  number  of  cars  in  a  parking  area,  as 
given  in  [42] ,  is  to  build  a  simple  net  linking  primitive  actions 
“Car  CO  appears,  Car  CO  enters  parking  area,  Car  CO  stops, 
Car  CO  leaves  parking  area”  in  a  sequential  order.  During 
the  inference  process,  the  positions  of  tokens  in  the  Petri  net 
summarize  the  history  of  past  events  and  predict  what  will 
happen  in  the  future  which  facilitate  incremental  recognition 
over  past  events. 

Summary  Knowledge-based  techniques,  although  easy 
to  understand,  make  several  assumptions  which  render  them 
ineffective  for  event  detection  in  unconstrained  videos.  As 
PTNs  rely  heavily  on  rule-based  abstractions  as  opposed  to 
probabilistic  learning-based  techniques,  methods  based  on 
such  formalism  are  too  rigid  to  be  applied  to  unconstrained 


cases  where  there  are  strong  content  diversities.  Although 
MLNs  incorporate  rule-based  abstraction  in  a  probabilis¬ 
tic  framework,  there  is  no  convincing  evidence  on  whether 
the  inferencing  mechanism  can  handle  complex  scenarios 
where  enumerating  all  possible  rules  is  practically  an  infea¬ 
sible  task.  Lor  the  same  reason,  representations  discussed  in 
[44, 125]  are  not  able  to  detect  complex  events  in  situations 
where  (basically)  no  domain  knowledge  is  available. 

3.4  Lusion  techniques 

Lusing  multiple  features  is  generally  helpful  since  differ¬ 
ent  features  abstract  videos  from  different  aspects,  and  thus 
may  complement  each  other.  To  recognize  complex  events 
in  unconstrained  videos,  acoustic  features  are  potentially 
important  since  the  original  soundtracks  of  such  videos 
are  mostly  preserved.  This  is  in  contrast  to  surveillance 
videos  with  no  audio  and  broadcast/movie  videos  mostly 
with  dubbed  soundtracks,  for  which  acoustic  features  are 
apparently  less  useful.  We  have  briefly  reviewed  a  few  audio¬ 
visual  representations  in  Sect.  2.4.  In  this  section,  we  discuss 
techniques  for  fusing  multiple  visual  and/or  audio  feature 
modalities. 

The  combination  of  multimodal  features  can  be  done  in 
various  ways.  The  most  popular  and  straightforward  strate¬ 
gies  are  early  fusion  and  late  fusion.  As  briefly  described  in 
Sect.  2.5,  early  fusion  concatenates  unimodal  features  into  a 
long  vector  for  event  learning  using  kernel  classifiers,  while 
late  fusion  feeds  each  unimodal  features  to  an  independent 
classifier  and  fusion  is  achieved  by  linearly  combining  the 
outputs  of  multiple  learners. 

Sadlier  and  O’  Connor  [118]  extracted  several  audio-visual 
features  for  event  analysis  in  sports  videos.  The  features  were 
fused  by  early  fusion.  Sunetal.  [133]  extracted  MLCC,  SILT, 
and  HOG  features  for  Web  video  categorization.  Both  early 
and  late  fusion  were  evaluated,  and  their  results  did  not  show 
a  clear  winner  of  the  two  fusion  strategies.  Since  the  early 
concatenation  of  features  may  amplify  the  “curse  of  dimen¬ 
sionality”  problem,  late  fusion  has  been  frequently  adopted 
for  multimodal  event  recognition  in  unconstrained  videos 
[48,57,58,96].  SILT,  STIP,  and  MLCC  features  were  lately 
fused  by  Jiang  et  al.  [58]  in  their  TRECVID  2010  MED  sys¬ 
tem.  Late  fusion  of  a  similar  set  of  features  was  also  used  by 
Inoue  et  al.  [48]  for  high-level  event  recognition. 

In  late  fusion,  the  selection  of  suitable  fusion  weights 
is  important.  Equal  weights  (average  fusion)  were  used  in 
[57,58],  while  [48,96]  adopted  cross  validation  to  select  data 
adaptive  weights  optimized  for  different  events  and  specific 
datasets.  Using  weighted  late  fusion,  excellent  event  recog¬ 
nition  results  were  achieved  by  [48,96]  in  the  MED  task  of 
NIST  TRECVID  2011  [99] .  It  is  worth-noting  that  late  fusion 
with  adaptive  weights  generally  outperforms  average  fusion 
when  training  and  test  data  follow  similar  distribution.  In  case 
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there  is  a  domain  change  between  training  and  test  videos, 
the  weights  learned  from  cross  validation  on  the  training  set 
may  not  generalize  well  to  the  test  set.  Therefore,  the  choice 
of  fusion  strategies  depends  on  specific  application  problem 
domains.  In  broad-domain  Internet  video  analysis,  adaptive 
weights  are  expected  to  be  more  effective  according  to  the 
conclusions  from  recent  developments. 

In  addition  to  early/late  fusion,  Tsekeridou  and  Pitas  [138] 
took  a  different  approach  which  combines  audio-visual  clues 
using  interaction  rules  (e.g.,  person  X  talking  in  scene  Y)  for 
broadcast  news  video  analysis.  Duan  et  al.  [31]  used  multiple 
kernel  learning  (MKL),  which  combines  multimodal  features 
at  kernel  level,  for  recognizing  events  in  Internet  videos.  They 
also  proposed  a  domain  adaptive  extension  of  MKL,  to  deal 
with  data  domain  changes  between  training  and  test  data, 
which  often  occur  in  Internet  scale  applications.  MKL  was 
also  adopted  in  [132]  to  combine  multiple  spatio-temporal 
features  computed  on  local  patch  trajectories. 

Note  that  the  fusion  techniques  discussed  above  are  not 
restricted  to  any  particular  type  of  classifier.  They  can  be 
combined  across  completely  different  classification  strate¬ 
gies.  For  example,  classifier  confidences  obtained  from  SVM 
using  low-level  feature  representations  can  be  fused  with 
that  obtained  from  other  high-level  classifiers  (e.g.,  HMMs, 
DBNs,  etc.)  based  on  a  completely  different  representation. 
The  only  issue  that  needs  to  be  addressed  while  fusing  outputs 
of  classifier  responses  is  that  the  classifier  outputs  need  to  be 
in  the  same  confidence  space.  To  deal  with  scale  variations 
commonly  seen  in  prediction  scores  from  different  classi¬ 
fiers,  Ye  et  al.  [169]  proposed  a  rank-based  fusion  method 
that  utilizes  rank  minimization  and  sparse  error  models  to 
recover  common  rank  orders  of  results  produced  by  multiple 
classifiers. 

Summary  Current  research  in  this  direction  is  mostly  lim¬ 
ited  to  straightforward  approaches  of  early  or  late  fusion. 
However,  to  design  a  robust  system  for  high-level  event 
recognition,  fusion  techniques  play  an  extremely  important 
role.  As  research  strives  towards  more  efficient  multimodal 
representation  of  videos,  the  study  of  better  fusion  techniques 
is  expected  to  gather  momentum. 

4  Application  requirements 

In  this  section,  we  discuss  several  issues  that  have  emerged 
due  to  application  requirements,  including  event  localization 
and  recounting,  and  scalable  techniques  which  are  key  to 
Internet  scale  processing. 

4.1  Event  localization  and  recounting 

As  discussed  earlier,  most  works  view  event  recognition  as 
a  classification  process  that  assigns  an  input  video  a  binary 


or  a  soft  probability  label  according  to  the  presence  of  each 
event.  However,  many  practical  applications  demand  more 
than  video-level  event  classification.  Two  important  prob¬ 
lems  that  will  significantly  enhance  fine-grained  video  analy¬ 
sis  are  spatial-temporal  event  localization  and  textual  video 
content  recounting.  The  former  tries  to  identify  the  spatial- 
temporal  boundaries  of  an  event,  while  the  latter  aims  at 
accurately  describing  video  contents  using  concise  natural 
languages.  Technically,  solutions  to  both  problems  may  be 
only  one  step  ahead  of  video  classification,  but  they  are  still 
in  their  childhood,  and  may  become  mature  with  sufficient 
efforts  paid  in  the  next  several  years.  We  discuss  them  below. 

4.1.1  Spatio-temporal  localization 

Direct  video-level  event  recognition  using  kernel  classifiers, 
where  it  is  assumed  that  a  video  has  already  been  temporally 
segmented  into  clips  and  the  task  of  classifier  is  to  assign 
each  clip  a  label,  has  been  extensively  studied.  However, 
locating  exactly  the  spatial-temporal  position  where  an  event 
happens  is  relatively  less  investigated.  One  reason  is  that — 
for  high-level  complex  events — it  is  difficult  to  define  precise 
temporal  boundaries,  let  alone  the  spatial  positions.  However, 
even  knowing  approximate  locations  could  be  beneficial  for 
applications  like  precise  video  search. 

Several  efforts  have  been  made  to  sports  video  event  local¬ 
ization.  Zhang  and  Chang  [173]  developed  a  system  for  event 
detection  in  baseball  videos.  Events  can  be  identified  based 
on  score  box  detection  and  OCR  (Optical  Character  Recog¬ 
nition).  Since  a  baseball  event  follows  a  strict  temporal  order 
(e.g.,  it  begins  with  a  pitching  view  and  ends  with  a  non¬ 
active  view),  the  temporal  location  can  be  easily  detected. 
Xu  et  al.  [161]  proposed  to  incorporate  web-casting  text  into 
sports  video  event  detection  and  observed  significant  gain 
especially  for  the  cases  that  cannot  be  handled  by  audio¬ 
visual  features. 

Besides  sports  events,  several  methods  have  been  pro¬ 
posed  for  temporally  localizing  human  actions  in  videos 
or  spatially  detecting  objects  (e.g.,  “person”  and  “car”)  in 
images.  These  techniques  form  a  good  foundation  for  future 
research  of  high-level  event  localization.  Duchenne  et  al.  [32] 
used  movie  script  mining  to  automatically  collect  (noisy) 
training  samples  for  action  detection.  To  locate  the  temporal 
positions  of  human  actions,  they  adopted  a  popularly  used 
sliding  window  approach  by  applying  SVM  classification 
over  temporal  windows  of  variable  lengths.  Hu  et  al.  [46] 
employed  multiple  instance  learning  to  deal  with  spatial  and 
temporal  ambiguities  in  bounding-box-based  human  action 
annotations.  This  method  was  found  useful  when  the  videos 
are  captured  in  complex  scenes  (e.g.,  supermarket).  Simi¬ 
lar  to  the  idea  of  sliding  window  search,  Satkin  and  Hebert 
[120]  located  the  best  segment  in  a  video  for  action  training 
by  exhaustively  checking  all  possible  segments  of  the  video. 
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Fig.  9  Video  event  recounting 
examples  generated  by  the 
approach  proposed  by  Tan  et  al. 
[134].  The  one  on  the  right  is  a 
failure  case  due  to  incorrect 
prediction  of  scene  context 
(reprinted  from  [134],  ©2011 
ACM) 


The  background  is  a  baseball 
field.  Someone  is  running  on 
baseball  field.  Wow!  The 
supporters  are  cheering. 
Applause!  ... 


Someone  is  making  stuffs 
with  hands  outdoors. 
Several  people  are 
assembling  a  shelter. 
Someone  is  talking... 


Someone  is  making  stuff 
with  hands  in  the  kitchen. 
Someone  is  talking... 


Oikonomopoulos  et  al.  [102]  proposed  to  learn  action  spe¬ 
cific  codebooks,  where  each  codeword  is  an  ensemble  of  local 
features,  with  spatial  and  temporal  locations  recorded.  The 
learned  codebook  was  used  to  predict  the  spatial-temporal 
locations  of  an  action-of-interest  in  test  videos,  using  a  sim¬ 
ple  voting  scheme  with  Kalman  filter-based  smoothing. 

Spatially  localizing  objects  in  images  has  been  extensively 
studied  in  the  literature.  One  seminal  work  is  the  Viola- 
Jones  real-time  object  detector  [148],  which  is  based  on  a 
boosted  cascade  learning  framework,  using  features  derived 
from  integral  images  that  can  be  computed  more  efficiently. 
Using  a  similar  framework,  Vedaldi  et  al.  [145]  integrated 
several  features  using  multiple  kernel  learning,  which  led  to 
one  of  the  best-performing  systems  in  the  object  detection 
task  of  2009  PASCAL  VOC  Challenge.  Among  many  other 
recent  efforts  on  object  detection,  a  representative  work  is 
by  Felzenszwalb  et  al.  [35],  who  used  deformable  part-based 
models  with  several  important  innovations  like  the  proposal 
of  a  latent  SVM  formulation  for  model  learning  and  strate¬ 
gies  for  selecting  hard  negative  examples  (those  which  are 
difficult  to  be  differentiated).  This  approach  is  now  popularly 
adopted.3 

Spatio-temporal  localization  of  concepts  is  helpful  for 
localizing  the  occurrence  of  high-level  events.  Since  con¬ 
cepts  tend  to  co-occur  spatio-temporally,  once  localization 
information  of  a  key  concept  (e.g.,  human  face)  is  available, 
the  probabilities  of  detection  of  other  co-occurring  concepts 
increase,  thereby  enhancing  the  overall  accuracy  of  event 
detection.  However,  this  is  a  difficult  task  to  achieve  given 
the  current  stature  of  detectors.  In  addition  to  the  difficulty  of 
the  task,  the  exhaustive  search  using  typical  sliding-window- 
based  approaches  add  to  the  computational  complexity  of  the 
detection  algorithms,  questioning  their  viability  in  practical 
recognition  tasks. 

4.1.2  Textual  recounting 

Multimedia  event  recounting  (MER)  refers  to  the  task  of 
automatic  textual  explication  of  an  event  depicted  in  a  video. 


3  Source  codes  from  the  authors  of  [35]  are  available  at  http://www.es. 
brown.  edu/~pff/latent/ . 


Recently,  NIST  introduced  the  MER  task4  whose  goal  is  to 
produce  a  recounting  that  summarizes  the  key  evidence  of  the 
detected  event.  Textual  human-understandable  descriptions 
of  events  in  videos  may  be  used  for  a  variety  of  applications. 
Beyond  more  precise  content  search,  event  recounting  can 
also  enhance  media  access  for  people  with  low  vision. 

Kojima  et  al.  [63]  developed  a  system  to  generate  nat¬ 
ural  language  descriptions  for  videos  of  human  activities. 
First,  head  and  body  movements  were  detected,  and  then  case 
frames  [37]  were  used  to  express  human  activities  and  gen¬ 
erate  natural  language  sentences.  Recently,  Tan  et  al.  [134] 
proposed  to  use  audio-visual  concept  detection  (e.g.,  “base¬ 
ball  field”  and  “cheering  sound”)  to  analyze  Internet  video 
contents.  The  concept  detection  results  are  converted  into  tex¬ 
tual  descriptions  using  rule-based  grammar.  Some  example 
results  from  their  method  are  shown  in  Fig.  9. 

Other  relevant  works  on  visual  content  recounting  include 
[36,44, 105, 167],  all  of  which  focused  primarily  on  images, 
however.  Yao  et  al.  [167]  explored  a  large  image  database 
with  region-level  annotations  for  image  parsing  and  results 
are  then  converted  to  textual  descriptions  using  a  natural  lan¬ 
guage  generation  (NLG)  technique  called  head-driven  phrase 
structure  grammar  [110].  Ordonez  et  al.  [105]  used  1  million 
Flickr  images  to  describe  images.  Their  method  is  purely 
data-driven,  i.e.,  a  query  image  is  described  using  descrip¬ 
tions  of  its  most  visually  similar  image  in  the  database.  In 
a  similar  spirit  to  [105],  Feng  and  Lapata  [36]  also  lever¬ 
aged  a  large  set  of  Internet  pictures  (mostly  with  captions), 
to  automatically  generate  headline-like  captions  for  news 
images.  In  order  to  produce  short  headline  captions,  they 
adopted  a  well-known  probabilistic  model  of  headline  gen¬ 
eration  [9]. 

If  events  could  be  perfectly  recognized,  recounting  might 
become  an  easier  problem,  since  incorporating  knowledge 
from  text  tags  of  similar  videos  would  probably  work.  A  more 
sophisticated  approach  is  to  employ  hierarchical  concept- 
based  classification  as  discussed  in  Sect.  3.1.2,  which  can 
provide  key  evidences  in  support  of  the  detected  events  to 
perform  recounting.  This  can  provide  additional  information 


4  http://www.nist.gov/itl/iad/mig/mer.cfm. 
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for  recounting,  in  particular  when  used  in  conjugation  with  a 
model  that  captures  the  temporal  dependency  across  concepts 
(e.g.,  HMMs).  A  reasonable  recounting  for  a  complex  event 
such  as  “baking  a  cake”  can  be  exemplified  as  follows  with 
the  key  evidences  italicized  for  legibility:  This  video  is  shot 
in  an  indoor  kitchen  environment.  A  person  points  finger  to 
the  ingredients.  Then  he  mixes  the  ingredients  in  a  bowl  using 
a  blender  whose  noise  is  heard  in  the  background.  After  that 
he  puts  the  bowl  in  a  convectional  oven.  Finally  he  takes  the 
bowl  and  puts  it  on  a  table. 

4.2  Scalability  and  efficiency 

Speed  is  always  a  challenge  when  dealing  with  Internet- scale 
data.  Current  video  event  recognition  research  normally  deals 
with  only  a  few  hundreds  to  tens  of  thousands  of  videos.  To 
scale  up,  extra  care  needs  to  be  taken  in  choosing  features 
that  are  efficient  to  be  extracted,  recognition  models  which 
are  fast,  as  well  as  a  system  architecture  suitable  for  paral¬ 
lel  computing.  In  the  following  we  briefly  introduce  several 
representative  works  on  each  of  these  aspects. 

For  features,  the  SURF  descriptor  [12]  was  proposed  as 
a  fast  replacement  of  SIFT  [88].  Knopp  et  al.  [62]  extended 
SURF  to  efficiently  compute  3D  spatio-temporal  key -points. 
Further,  several  works  have  reported  that  dense  sampling, 
which  uniformly  selects  local  2D  image  patches  or  3D  video 
volumes,  can  be  adopted  in  place  of  the  expensive  sparse 
keypoint  detectors  (e.g.,  DoG  [88])  with  a  competitive  recog¬ 
nition  performance  [101].  Uijlings  et  al.  [142]  observed  that 
the  dense  SIFT  and  dense  SURF  descriptors  can  be  com¬ 
puted  more  efficiently  with  careful  implementations  that 
avoid  repetitive  computations  of  pixel  responses  in  overlap¬ 
ping  regions  of  nearby  image  patches. 

The  quantization  or  word  assignment  process  in  the  BoW 
representation  [127]  is  computationally  expensive  using 
brute-force  nearest  neighbor  search.  Nister  et  al.  [100] 
showed  that  quantization  can  be  executed  very  efficiently 
if  words  in  the  vocabulary  are  organized  in  a  tree  structure. 
Moosmann  et  al.  [93]  adopted  random  forest,  a  collection 
of  binary  decision  trees,  for  fast  quantization.  Shotton  et  al. 
[124]  proposed  semantic  texton  forests  (STF)  as  an  alterna¬ 
tive  image  representation.  STFs  are  ensembles  of  decision 
trees  that  work  directly  on  image  pixels,  and  therefore  can 
be  efficiently  computed  since  they  do  not  require  expensive 
local  key-point  detection  and  description.  Yu  et  al.  [170]  fur¬ 
ther  extended  STF  for  efficient  3D  spatio-temporal  human 
action  representation. 

As  introduced  earlier,  currently  the  most  popular  recogni¬ 
tion  method  is  the  SVM  classifier.  The  classification  process 
of  SVM  could  be  slow  when  nonlinear  kernels  such  as 
histogram  intersection  and  x2  are  adopted.  Maji  et  al.  [81] 
proposed  an  interesting  idea,  with  which  the  histogram  inter¬ 
section  and  x2  kernels  can  be  computed  with  logarithmic 
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Fig.  10  Illustration  of  a  typical  MapReduce  process 


complexity  in  the  number  of  support  vectors.  Uijlings  et  al. 
[142]  tested  this  method  on  video  concept  detection  tasks 
and  observed  a  satisfying  performance  in  both  precision  and 
speed.  Recently,  Jiang  [53]  conducted  an  extensive  evalu¬ 
ation  of  the  efficiency  of  features  and  classifier  kernels  in 
video  event  recognition.  The  fast  histogram  intersection  ker¬ 
nel  was  reported  to  be  reliable  and  efficient.  In  addition,  the 
simple  and  efficient  linear  kernel  was  shown  to  be  effective 
on  high-dimensional  feature  representations  like  the  Fisher 
vectors  [108]. 

On  the  other  hand,  learning  and  inference  algorithms  for 
graphical  models  have  been  extensively  investigated  in  the 
machine  learning  and  pattern  recognition  fields.  Frey  and 
Jojic  [40]  evaluated  several  popular  inference  and  learn¬ 
ing  algorithms  of  graph-based  probability  models  in  vision 
applications,  de  Campos  and  Ji  [18]  proposed  an  efficient 
algorithm  which  integrates  several  structural  constraints  for 
learning  Bayesian  Networks. 

Parallel  computing  is  very  important  for  large-scale  data 
processing.  Video  event  recognition  is  not  a  task  difficult  to 
be  split  and  run  on  multiple  machines  in  parallel,  as  there 
could  be  many  event  categories,  and  each  may  be  handled 
by  one  computer  (node).  In  addition,  testing  videos  can 
also  be  processed  independently.  MapReduce  is  probably  the 
most  popular  framework  for  processing  such  a  highly  distrib¬ 
utable  problem.  In  MapReduce,  a  task  is  a  basic  computation 
unit  such  as  classifying  a  video  clip  using  a  SVM  model. 
Figure  10  depicts  a  general  MapReduce  process.  The  “Map” 
step  employs  a  master  node  to  partition  the  problem  into 
several  tasks  and  distribute  them  to  worker  nodes  for  com¬ 
putation.  In  the  “Reduce”  step,  one  or  multiple  worker  nodes 
take  the  results  and  consolidate  them  to  form  the  final  out¬ 
put,  which  should  be  the  same  as  running  the  entire  problem 
on  a  single  node.  Several  works  have  discussed  the  use  of 
MapReduce  in  video  processing.  Yan  et  al.  [165]  adopted 
MapReduce  for  large-scale  video  concept  recognition.  They 
proposed  a  task  scheduling  algorithm  specifically  tailored 
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Table  1  Overview  of  TRECVID  MED  2010-2011  [99]  and  CCV  [57]  datasets 

Dataset 

#  Training/test  videos 

#  Classes 

#  Positive  videos  per  class 

Average  duration  (s) 

Format 

File  size  (GB) 

MED  2010 

1,746/1,741 

3 

89 

119 

mp4 

38 

MED  2011 

13,115/32,061 

10 

253 

114 

mp4 

559 

CCV 

4,659/4,658 

20 

394 

80 

flv 

30 

The  TRECVID  videos  are  available  upon  participation  of  the  benchmark  evaluation,  while  the  CCV  dataset  is  publicly 
available.  For  all  the  three  datasets,  the  positive  videos  are  evenly  distributed  in  the  training  and  test  sets 


for  the  concept  detection  problem  where  the  execution 
time  varies  across  different  tasks  (e.g.,  classifying  a  video 
using  SVM  models  with  different  number  of  support  vec¬ 
tors).  The  algorithm  estimates  the  computational  time  of 
each  task  a  priori,  which  effectively  compresses  system  idle 
time.  White  et  al.  [157]  discussed  MapReduce  implementa¬ 
tions  of  several  popular  algorithms  in  computer  vision  and 
multimedia  problems  (e.g.,  classifier  training,  clustering,  and 
bag-of-features). 

Another  possible  way  to  parallel  event  recognition  algo¬ 
rithms  is  to  use  tightly  coupled  computational  frameworks 
(computational  modules  make  active  communication  with 
each  other)  such  as  message  passing  interface  (MPI).5  This 
approach,  although  more  efficient,  requires  a  total  algorith¬ 
mic  redesign  and  a  steep  learning  curve  for  multimedia  and 
computer  vision  researchers.  Therefore,  a  more  practical 
solution  is  to  use  the  MapReduce  framework  or  other  closely 
similar  approaches  such  as  the  unstructured  information 
management  application  (UIMA).6  Since  these  approaches 
follow  a  loosely  coupled  computational  paradigm  where 
modules  do  not  need  to  make  active  communication  within 
themselves,  they  are  expected  to  be  favored  by  practitioners 
in  the  long  run. 

5  Evaluation  benchmarks 

Standard  datasets  for  human  action  recognition  research 
include  those  captured  under  constrained  environments  like 
KTH  [121],  Weizmann  [14],  IXMAS  [155]  and  several  more 
realistic  ones  such  as  UCF1 1  [74],  UCF  Sports  [114],  UCF50 
action  dataset  [143],  the  Hollywood  Movie  dataset  [66], 
and  the  more  recently  released  Human  Motion  Database 
(HMDB)  [64].  These  benchmark  datasets  have  played  a 
very  important  role  in  advancing  the  state  of  the  arts  in 
human  action  analysis.  In  this  section,  we  discuss  evalua¬ 
tion  benchmarks  for  high-level  event  recognition  in  uncon¬ 
strained  videos. 


5  http :// w ww.mc s . anl.gov/research/proj ects/mpi/ . 

6  http://uima.apache.org/. 


5.1  Public  datasets 

TRECVID  MED  datasets  [99]  Motivated  by  the  need  of 
analyzing  complex  events  in  Internet  videos,  the  annual  NIST 
TRECVID  [128]  activity  defined  a  new  task  in  2010  called 
multimedia  event  detection  (MED).  Each  year  a  new  (or  an 
extended)  dataset  is  created  for  cross-site  system  comparison. 
Table  1  summarizes  the  2010  and  2011  editions  of  TRECVID 
MED  datasets.  The  MED  data  consist  of  user-generated  con¬ 
tent  from  Internet  video  hosting  sites,  collected  and  annotated 
by  the  Linguistic  Data  Consortium  (LDC7).  Figure  11  gives 
an  example  for  each  event  class.  In  MED  2010,  only  three 
events  were  defined,  all  of  which  are  long-term  procedures. 
The  number  of  classes  increased  to  15  in  the  much  larger 
MED  2011  dataset.  Out  of  the  15  classes,  5  are  only  anno¬ 
tated  on  the  training  set  for  system  development  (e.g.,  feature 
design  and  parameter  tuning),  and  the  remaining  10  are  used 
in  the  official  evaluation.  Besides  several  procedure  events, 
there  are  also  a  few  social  activity  events  included  in  201 1, 
e.g.,  “wedding  ceremony”  and  “birthday  party”.  The  current 
editions  of  MED  data  only  contain  binary  event  annotations 
on  video-level,  and  the  MED  task  is  focused  only  on  video¬ 
level  event  classification. 

Columbia  consumer  video  (CCV)  dataset  [57]  CCV8 
dataset  was  collected  in  20 1 1  to  stimulate  research  on  Internet 
consumer  video  analysis.  Consumer  videos  are  captured  by 
ordinary  consumers  without  professional  post-editing.  They 
contain  very  interesting  and  diverse  content,  and  occupy  a 
large  portion  in  Internet  video  sharing  activities  (most  of  the 
MED  videos  are  also  consumer  videos).  A  snapshot  of  the 
CCV  dataset  can  be  found  in  Table  1 .  20  classes  are  defined, 
covering  a  wide  range  of  topics  including  objects  (e.g.,  “cat” 
and  “dog”),  scenes  (e.g.,  “beach”  and  “playground”),  sports 
events  (e.g.,  “baseball”  and  “skiing”),  and  social  activity 
events  (e.g.,  “graduation”  and  “music  performance”).  Class 
annotations  in  CCV  were  also  performed  on  video-level.  The 
classes  were  defined  according  to  the  Kodak  consumer  video 
concept  ontology  [76].  The  Kodak  ontology  contains  over 


7  http://www.ldc.upenn.edu/. 

8  Download  site:  http://www.ee.columbia.edu/dvmm/CCV/. 
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MED 

2010 


Assembling  a  shelter 


Batting  a  run  in 


Making  a  cake 


MED 

2011 

devel. 

events 


Attempting  a  board  trick 


Feeding  an  animal 


Landing  a  fish 


Wedding  ceremony  Working  on  a 

woodworking  project 


MED 

2011 


Birthday  party 


testing 

events 


Making  a  sandwich 


Flash  mob  gathering  Getting  a  vehicle  unstuck  Grooming  an  animal 


Changing  a  vehicle  tire 


Parade 


Parkour 


Repairing  an  appliance  Working  on  a  sewing  project 


Fig.  11  Examples  of  TRECVID  MED  2010  and  2011  events.  In  201 1,  in  addition  to  10  events  used  for  official  evaluation,  TRECVID  also  defined 
5  events  for  system  development  (e.g,  parameter  tuning) 


100  concept  definitions  based  on  rigorous  user  studies  to 
evaluate  the  usefulness  and  observability  (popularity)  of  each 
concept  found  in  actual  consumer  videos. 

Kodak  consumer  video  dataset  [76]  Another  dataset  for 
unconstrained  video  analysis  is  the  Kodak  consumer  video 
benchmark  [76].  The  Kodak  consumer  videos  were  collected 
by  around  100  customers  of  Eastman  Kodak  Company.  There 
are  1,358  video  clips  labeled  with  25  concepts  (a  part  of 
the  Kodak  concept  ontology).  Compared  to  MED  and  CCV 
datasets,  one  limitation  of  the  Kodak  dataset  is  that  there  is 
not  enough  intra-class  variation.  Many  videos  were  captured 
under  similar  scenes  (e.g.,  many  “picnic”  videos  were  taken 
at  the  same  location),  which  make  this  dataset  vulnerable  to 
over-fitting  issues. 

There  are  also  a  few  other  datasets  for  unconstrained  video 
analysis,  e.g.,  LabelMe  Video  [172]  and  MCG-WEBV  [19]. 
LabelMe  Video  is  built  upon  the  LableMe  image  annota¬ 
tion  platform  [116].  An  online  system  is  used  to  let  Internet 
users  to  label  not  only  event  categories  but  also  outlines  and 
spatial-temporal  locations  of  moving  objects.  The  granularity 
of  annotations  is  very  suitable  for  finer-grained  event  recogni¬ 
tion.  However,  since  the  labeling  process  is  time-consuming 
and  does  not  lead  to  any  payment,  the  amount  of  collected 
annotations  is  dependent  on  highly  motivated  users.  So  far  the 
annotations  in  LabelMe  Video  are  quite  limited  in  both  scale 


and  class  diversity,  and  there  is  no  video  suitable  for  high- 
level  event  analysis.  MCG-WEBV  is  a  large  set  of  YouTube 
videos  organized  by  the  Chinese  Academy  of  Sciences.  The 
current  version  of  MCG-WEBV  contains  234,414  videos, 
with  annotations  on  several  topic-level  events  like  “a  conflict 
at  Gaza”,  which  are  too  complicated  and  diverse  to  be  recog¬ 
nized  by  content  analysis  alone.  Existing  works  using  this 
dataset  are  mostly  for  video  topic  tracking  and  documenta¬ 
tion,  by  exploiting  textual  contexts  (e.g.,  tags  and  descrip¬ 
tions)  and  metadata  (e.g.,  video  uploading  time). 

The  availability  of  annotated  data  for  training  classifiers 
for  event  detection  is  a  vital  challenge.  Recently,  crowdsourc¬ 
ing  efforts  using  the  LabelMe  toolkits  [1 16, 172]  and  the  more 
general  Amazon  Mechanical  Turk  (AMT)  platform9  (used 
in  the  annotation  of  the  CCV  dataset)  have  been  used  exten¬ 
sively  to  annotate  videos  and  images  manually  in  a  more 
efficient  manner.  It  is  expected  to  gain  more  popularity  as 
researchers  become  aware  of  these  tools. 

5.2  Performance  metrics 

Event  recognition  results  can  be  measured  in  various  ways, 
depending  on  the  application  requirements.  We  first  consider 


9  http://www.mturk.com/. 
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the  most  simple  and  popular  case,  where  the  determination  of 
event  presence  is  at  the  entire  video  level.  This  is  essentially 
a  classification  problem:  given  an  event  of  interest,  a  recogni¬ 
tion  system  generates  a  confidence  score  for  each  input  video. 

Average  precision  (AP)  and  normalized  detection  cost 
(NDC)  are  the  most  widely  used  measurements  of  video  event 
recognition  performance.  The  input  to  both  AP  and  NDC  is 
a  ranked  video  list  according  to  the  recognition  confidence 
scores.  We  briefly  introduce  each  of  them  in  the  following. 
In  addition  to  AP  and  NDC,  metrics  based  on  detection-error 
tradeoff  (DET)  curves  are  being  recently  used  to  evaluate  per¬ 
formance  of  event  detection.  The  DET  curves,  as  the  name 
indicates,  are  generated  from  the  probabilities  of  misclassi- 
fication  and  false  alarms  produced  by  a  given  classifier. 

Average  precision  AP  is  a  single-valued  measurement 
approximating  the  area  under  a  precision-recall  curve,  which 
reflects  the  quality  of  the  ranking  of  test  videos  (according 
to  classification  probability  scores).  Denote  R  as  the  number 
of  true  relevant  videos  in  a  target  dataset.  At  any  index  j,  let 
Rj  be  the  number  of  relevant  videos  in  the  top  j  list.  AP  is 
defined  as 


where  Ij  =  1  if  the  j  th  video  is  relevant  and  0  otherwise.  AP 
favors  highly  ranked  relevant  videos.  It  returns  a  full  score 
(AP  =1)  when  all  the  relevant  videos  are  ranked  on  top  of 
the  irrelevant  ones. 

Normalized  detection  cost  Figure  12  illustrates  the  basic 
concepts  involved  in  computing  the  NDC,  which  is  the  offi¬ 
cial  performance  metric  of  the  TRECVID  MED  task  [38]. 
Different  from  AP  that  evaluates  the  quality  of  a  ranked  list, 
NDC  requires  a  recognition  threshold.  Videos  with  confi¬ 
dence  scores  above  the  threshold  are  considered  relevant  (i.e., 
the  relevant  set  in  the  figure).  Specifically,  given  a  recogni¬ 
tion  threshold,  we  first  define  Pmd  (miss  detection  rate)  and 


FA/(FA  +TN)  ■  False  Negatives  (FN)  /  Miss  Detections  (MD) 

Targets:  TP+MD  ■  True  Negatives  (TN) 

Fig.  12  An  illustration  of  the  terminologies  used  to  compute  NDC 


Pfa  (false  alarm  rate): 


Pmd  = 
Pfa  = 


ttmisses 
# targets  ’ 

# false  alarms 
tttotal  videos  —  # targets  ’ 


where  # targets  is  the  total  number  of  videos  containing  the 
target  event  in  a  dataset.  With  Pmd  and  Pfa,  NDC  can  be 
computed  as: 


XTT^  Cmd  x  Pmd  x  PT  +  Cfa  x  Pp a  x  (1  -  Pj) 
min(CMD  x  Pj,  C fa  x  (1  -  Pj)) 

where  Pj  is  the  prior  probability  of  the  event  (i.e., 
motallfdeos cmd  and  CFa  are  positive  cost  parameters  to 
weigh  the  importance  of  Pmd  and  Pfa,  respectively. 

As  can  be  seen,  NDC  uses  two  cost  parameters  to  weigh 
the  importance  of  miss  detection  rate  and  false  alarm  rate. 
As  a  result,  NDC  provides  a  more  flexible  way  than  AP  to 
evaluate  recognition  performance.  Different  from  AP,  lower 
NDC  value  indicates  better  performance.  Based  on  NDC, 
NIST  uses  two  variants  to  measure  the  performance  of  MED 
systems,  namely  ActualNDC  and  MinimalNDC.  ActualNDC 
is  based  on  the  threshold  provided  by  the  participants  based 
on  their  algorithms,  while  MinimalNDC  is  computed  by  the 
optimal  threshold,  i.e.,  the  threshold  that  leads  to  the  mini¬ 
mum  NDC  value  on  a  ranked  list.  MinimalNDC  is  adopted 
as  an  additional  measurement  to  ActualNDC  since  the  latter 
is  sensitive  to  the  automatically  predicted  threshold. 

Partial  area  under  DET  curve  The  DET  curve,  intro¬ 
duced  by  Martin  et  al.  [84] ,  is  often  used  for  evaluating  detec¬ 
tion  performance  where  the  number  of  negative  samples  is 
significantly  larger  than  that  of  the  positive  ones.  The  curve 
is  generated  by  plotting  false  alarm  rate  versus  miss  detec¬ 
tion  rate  after  scaling  the  axes  non-linearly  by  their  standard 
normal  deviates.  In  order  to  quantitatively  evaluate  the  per¬ 
formance  of  a  classifier,  the  area  under  the  DET  curve  can  be 
used  as  a  single-value  metric  which  is  inversely  proportional 
to  the  classifier  performance.  However,  the  whole  area  under 
the  curve  may  not  be  meaningful,  which  is  why  a  portion  of 
the  curve  under  a  predefined  operating  region  is  considered. 
Figure  13  illustrates  the  idea  of  using  the  partial  area  under 
DET  curve  as  a  metric  under  60  %  miss  detection  at  5  %  false 
alarm  operating  region. 

Spatio-temporal  localization  Unlike  the  video-level 
classification,  spatio-temporal  localization  demands  an  eval¬ 
uation  measure  that  works  in  a  finer  resolution.  Prior  works 
on  spatial  [145]  and  temporal  [32]  localization  are  also  eval¬ 
uated  by  average  precision  (AP).  Take  temporal  event  local¬ 
ization  as  an  example  [32],  systems  return  a  list  of  video  clips 
with  variable  durations  (instead  of  a  list  of  videos),  ranked 
by  the  likelihood  of  fully  containing  the  target  event  with  no 
redundant  frames.  A  clip  is  treated  as  a  correct  hit  if  it  overlaps 
with  a  ground-truth  event  over  a  certain  percentage  (normally 
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False  Alarm  probability  (in  %) 

Fig.  13  An  illustration  of  metric  selection  in  a  detection  error  tradeoff 
curve 

50  %).  With  this  judgment,  AP  can  be  easily  applied  over  the 
ranked  list  of  video  clips.  Similarly,  this  can  be  extended  to 
spatial  localization  and  spatial-temporal  joint  localization. 

Multimedia  event  recounting  The  more  challenging 
problem  is  event  recounting,  which  is  very  difficult  to  be 
quantitatively  evaluated.  Most  works  on  textual  recounting 
used  subjective  user  studies  [134, 105].  A  number  of  criteria 
such  as  completeness  and  redundancy  are  defined,  based  on 
which  users  are  asked  to  score  each  criterion.  Some  quan¬ 
titative  measures  were  also  used  as  an  additional  measure 
to  the  subjective  user  evaluation.  BLEU  score  [106],  which 
is  very  popular  for  evaluating  machine  translation  quality, 
was  adopted  by  Ordonez  et  al.  [105]  to  measure  the  quality 


of  image  captioning.  Such  a  quantitative  criterion,  however, 
cannot  really  measure  the  consistency  between  the  semantic 
meanings  of  the  machine-generated  sentences  and  the  video 
content. 

5.3  Forums  and  recent  approaches 

A  few  forums  have  been  set  up  to  stimulate  video  con¬ 
tent  recognition,  e.g.,  NIST  TRECVID  [128]  and  MediaEval 
[86].  In  this  section,  we  focus  our  discussions  on  the  annual 
NIST  TRECVID  evaluation  [128],  since  it  is  to  our  knowl¬ 
edge  the  only  forum  that  fully  focuses  on  video  analysis  and 
has  made  consistently  high  impacts  over  the  years.  NIST 
defines  several  tasks  every  year,  focusing  on  various  issues 
on  video  retrieval.  Among  them,  MED  is  the  task  evaluating 
systems  for  high-level  event  recognition  in  unconstrained 
videos.  Initiated  in  2010,  MED  already  found  its  way  to 
advance  the  state  of  the  arts. 

The  number  of  participated  teams  in  MED  task  has 
increased  quickly  from  7  (2010)  to  19  (2011).  We  summarize 
all  the  submitted  results  in  Fig.  14.  Each  team  may  submit 
multiple  results  to  test  the  effectiveness  of  various  combina¬ 
tions  of  system  modules.  The  number  of  submitted  results 
per  team  was  limited  to  4  in  201 1 .  Such  a  limitation  did  not 
exist  in  2010,  and  thus  the  number  of  submitted  results  did 
not  increase  at  the  same  pace  with  the  number  of  teams.  In 
terms  of  mean  MinimalNDC  over  the  evaluated  event  classes, 
we  see  significant  improvements  from  MED  2011  compared 
to  the  previous  year.  However,  it  is  important  to  notice  that 
results  are  not  directly  comparable  across  multiple  years  due 
to  the  changes  in  video  data  and  event  classes. 

We  briefly  discuss  the  techniques  of  the  teams  who  pro¬ 
duced  top-performing  results.  In  2010,  the  Columbia-UCF 
joint  team  [58]  achieved  the  best  performance  using  a  frame¬ 
work  combining  multiple  modalities,  concept-level  context 
(based  on  21  scene/action/audio  concept  classifiers),  and 
temporal  matching  techniques.  Three  audio-visual  features 
(SIFT  [77],  STIP  [65],  and  MFCC)  were  extracted  and  con- 


(b) 


Fig.  14  Performance  of  TRECVID  MED  2010  and  2011  submissions,  sions  from  19  teams.  There  are  3  test  events  in  MED  2010  and  10  test 

measured  using  mean  MinimalNDC  over  all  the  evaluated  events,  (a)  events  MED  2011 

MED  2010,  45  submissions  from  7  teams,  (b)  MED  2011,  60  submis- 
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verted  into  the  bag-of- words  representations.  SVM  classifier 
was  adopted  to  train  models  separately  using  each  feature, 
and  results  were  combined  by  average  late  fusion.  Specifi¬ 
cally,  for  SVM  they  used  both  standard  x2  kernel  and  the 
EMD  (earth  mover’s  distance)  kernel  [162],  where  the  lat¬ 
ter  was  applied  to  alleviate  the  effect  of  event  temporal  mis¬ 
alignment.  One  important  observation  in  [58]  is  that  the  three 
audio-visual  features  are  highly  complementary.  While  tem¬ 
poral  matching  with  EMD  kernel  led  to  noticeable  gain  to 
some  events,  the  concept-level  context  did  not  show  clear 
improvements. 

In  MED  2011,  the  best  results  were  achieved  by  a  large 
collaborative  team,  named  VISER  [96] .  Many  features  were 
adopted  in  the  VISER  system,  such  as  SIFT  [77],  Color-SIFT 
[1 19],  SURF  [12],  HOG  [27],  MFCC,  Audio  Transients  [26], 
etc.  Similar  to  [58],  bag-of- words  representation  was  used 
to  convert  each  of  the  feature  sets  into  a  fixed-dimensional 
vector.  A  joint  audio-visual  bi-modal  representation  [168] 
was  also  explored,  which  encodes  local  pattern  across  the 
two  modalities.  Different  fusion  strategies  were  used — a  fast 
kernel-based  method  for  early  fusion,  a  Bayesian  model  com¬ 
bination  for  optimizing  performance  at  a  specific  operation 
point,  and  weighted  average  fusion  for  optimal  performance 
over  the  entire  performance  curve.  In  addition,  they  also  uti¬ 
lized  other  ingredients  like  object  and  scene  level  concept 
classifiers  (e.g.,  the  models  provided  in  [136]),  automatic 
speech  recognition  (ASR),  and  OCR.  Their  results  showed 
that  the  audio-visual  features,  including  the  bi-modal  rep¬ 
resentation,  are  very  effective.  The  concept  classifiers  and 
the  fusion  strategies  also  offered  some  improvements,  but 
the  ASR  and  OCR  features  were  less  helpful  perhaps  due  to 
their  low  occurrence  frequencies  in  this  specific  dataset. 


6  Future  directions 

Although  significant  efforts  have  been  devoted  to  high-level 
event  recognition  during  the  past  few  years,  the  current  recog¬ 
nition  accuracy  for  many  events  is  still  far  from  satisfac¬ 
tory.  In  this  section,  we  discuss  several  promising  research 
directions  that  may  improve  event  recognition  performance 
significantly. 

Better  low-level  features  There  have  been  numerous 
works  focusing  on  the  design  of  low-level  features.  Represent¬ 
ative  ones  like  SIFT  [77]  and  STIP  [65]  have  already  greatly 
improved  recognition  accuracy,  compared  with  the  tradi¬ 
tional  global  features  like  color  and  texture.  However,  it  is 
clear  from  the  results  of  the  latest  systems  that  these  state-of- 
the-art  low-level  features  are  still  insufficient  for  representing 
complex  video  events.  Such  handcrafted  features,  particu¬ 
larly  the  gradient-based  ones  (e.g.,  SIFT,  HOG  [27],  and 
variants),  are  already  reaching  their  limit  in  image  and  video 


processing.  Thus,  the  community  needs  good  alternatives  that 
can  better  capture  key  characteristics  of  video  events. 

In  place  of  the  handcrafted  static  or  spatial-temporal  local 
features,  a  few  recent  works  which  exploited  deep  learning 
methods  to  automatically  learn  hierarchical  representations 
[69, 135, 146]  open  up  a  new  direction  that  deserves  further 
studies.  These  automatically  learned  features  already  show 
similar  or  even  better  performance  than  the  handcrafted  ones 
on  popular  benchmarks.  In  addition  to  the  visual  features, 
another  factor  that  should  never  be  neglected  is  the  audio 
track  of  videos,  which  is  very  useful  as  discussed  earlier  in 
this  paper.  Since  audio  and  vision  were  mostly  separately 
investigated  in  two  different  communities,  limited  research 
(except  [51,168])  has  been  done  on  how  audio-visual  cues 
can  be  jointly  used  to  represent  video  events  (cf.  Sect.  2.4). 
The  importance  of  this  problem  needs  to  be  highlighted  to 
attract  more  research  attention.  We  believe  that  good  joint 
audio-visual  representations  may  lead  to  a  big  leap  in  video 
event  recognition  accuracy. 

Beyond  BoW  +  SVM  Most  of  the  currently  well  perform¬ 
ing  event  recognition  systems  rely  on  a  simple  pipeline  that 
uses  BoW  representations  of  various  visual  descriptors  and 
SVM  classification.  Although  this  approach,  with  years  of 
study  in  optimizing  the  design  details,  has  to-date  the  highest 
accuracy,  the  room  for  further  improvement  is  very  limited. 
Thus  a  natural  question  that  arises  is:  Are  there  any  more 
promising  alternative  solutions?  While  the  exact  solution 
may  be  unclear,  the  answer  to  the  question  is  quite  positive. 
There  has  been  a  recent  surge  in  neural  networks  research  on 
improving  the  accuracy  of  bag-of- words  based  representa¬ 
tions  [23, 147].  These  approaches  show  promising  improve¬ 
ments  in  document  classification  over  regular  bag-of- words 
based  approaches  and  hence  are  expected  to  improve  event 
detection  using  conventional  bag-of-words  representation. 
Another  interesting  direction  is  to  explore  solutions  that  use 
prior  knowledge,  an  intuitively  very  helpful  resource  that 
has  been  almost  fully  ignored  in  the  current  BoW  +  SVM 
pipeline.  As  it  is  true  for  humans  that  external  knowledge  is 
always  important  for  perception,  we  believe  it  is  also  crit¬ 
ical  for  the  design  of  a  robust  automatic  event  recognition 
system.  Although  current  knowledge-based  models  have  not 
yet  shown  promising  results,  this  direction  deserves  more 
investigation. 

Event  context  and  attributes  Complex  events  can  be 
generally  decomposed  into  a  set  or  sequence  of  concepts 
(actions,  scenes,  objects,  audio  sounds,  etc.),  which  are  rela¬ 
tively  easier  to  be  recognized  since  they  have  much  smaller 
semantic  granularity  and  thus  are  visually  or  acoustically 
more  distinct.  Once  we  have  a  large  number  of  contextual 
concept  detectors,  the  detection  results  can  be  applied  to  infer 
the  existence  of  an  event.  As  discussed  earlier  in  Sect.  3.1.2, 
there  are  several  works  exploring  this  direction  with,  never¬ 
theless,  very  straightforward  modeling  methods.  In  computer 
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vision,  a  similar  line  of  research,  namely  attribute-based 
methods,  also  emerged  recently  for  various  visual  recog¬ 
nition  tasks.  A  few  questions  still  need  to  be  addressed: 
Whether  one  should  manually  specify  concepts  or  attributes 
(supervised  learning),  or  automatically  discover  them  from 
an  existing  vocabulary  (unsupervised  learning)?  How  many 
and  what  concepts  should  be  adopted?  Is  there  a  universal 
vocabulary  of  concepts  that  can  be  used  for  applications  in 
any  domain?  How  to  reliably  detect  these  concepts,  and  how 
to  model  events  based  on  the  concepts?  Each  of  these  prob¬ 
lems  requires  serious  and  in-depth  investigations.  This  may 
look  like  a  difficult  direction.  However,  once  these  problems 
are  tackled,  recognizing  complex  events  would  eventually  be 
much  more  solvable. 

Ad  hoc  event  detection  Ad  hoc  event  detection  refers  to 
the  cases  where  very  few  examples  are  available  and  the  sys¬ 
tem  does  not  have  any  prior  knowledge  about  a  target  event. 
Techniques  for  ad  hoc  event  detection  is  needed  in  retrieval¬ 
like  scenarios,  where  users  supply  one  or  a  few  examples  of 
an  event-of-interest  to  retrieve  relevant  videos  in  a  limited 
amount  of  time.  Such  problems  are  often  termed  as  one-shot 
or  few- shot  learning.  Apparently  the  knowledge-based  solu¬ 
tions  are  incapable  of  performing  this  task  since  the  event  is 
not  known  a  priori.  The  performance  of  supervised  learning 
classifiers  is  questionable  due  to  the  small  number  of  train¬ 
ing  samples.  To  this  end,  one  can  leverage  knowledge  from 
text  to  derive  semantic  similarity  between  annotated  and  un¬ 
discovered  concepts,  which  can  lead  to  the  discovery  of  new 
concepts,  for  which  there  is  no  training  data  available  [6] .  The 
idea  of  semantic  similarity  can  be  extended  to  different  lev¬ 
els  of  the  event  hierarchy  to  detect  concepts  with  previously 
unseen  exemplars,  or  complex  events  with  no  training  exem¬ 
plars.  Following  the  discussions  on  event  context,  once  the 
videos  are  offline  indexed  with  a  large  number  of  concepts, 
online  retrieval  or  detection  of  unknown  events  becomes  pos¬ 
sible  since  videos  of  the  same  event  are  very  likely  to  con¬ 
tain  similar  concept  occurrence  distributions.  In  other  words, 
event  detection  can  be  achieved  by  measuring  the  similarity 
of  the  concept  occurrence  vectors  between  query  examples 
and  database  videos.  This  converts  the  ad  hoc  event  detection 
task  into  a  nearest  neighbor  search  problem,  to  which  highly 
efficient  hashing  techniques  [152, 156]  or  indexing  methods 
[126]  may  be  applied  to  achieve  real-time  retrieval  in  large 
databases. 

Better  event  recounting  Very  limited  works  have  been 
done  on  event  recounting,  although  this  capability  is  needed 
by  many  applications  as  discussed  earlier.  Precise  video  event 
recounting  is  a  challenging  problem  that  demands  not  only 
highly  accurate  content  recognition  but  also  good  NLP  mod¬ 
els  to  make  the  final  descriptions  as  natural  as  possible. 
Recognizing  a  large  number  of  concepts  (organized  in  a  hier¬ 
archy)  is  certainly  a  good  direction  to  pursue,  where  an  inter¬ 
esting  sub-problem  is  “how  to  rectify  false  detections  based 


on  contextual  relationships  (e.g.,  co-occurrence,  causality, 
etc.)  that  exist  across  concepts?”  To  generate  good  descrip¬ 
tions,  purely  analyzing  video  content  may  not  be  sufficient 
for  automatic  techniques,  which  still  have  a  long  way  to  go 
to  really  reach  humans’  capability.  To  narrow  this  gap,  the 
rich  information  on  the  Web  may  be  a  good  complement  since 
surrounding  texts  of  visually  similar  videos  may  be  exploited 
to  recount  a  target  video,  even  when  the  semantic  content  of 
the  video  cannot  be  perfectly  recognized. 

Better  benchmark  datasets  The  TRECVID  MED  task 
has  set  up  a  good  benchmark  for  video  event  recognition. 
However,  currently  the  number  of  events  is  still  limited  to 
10-20,  which  is  much  fewer  than  the  actual  number  of  events 
that  may  appear  in  videos.  On  the  one  hand,  this  prevents  the 
exploration  of  techniques  that  utilizes  the  co-occurrence  or 
casuality  between  multiple  events  in  a  video.  On  the  other 
hand,  conclusions  drawn  from  a  small  set  of  events  may  not 
generalize  well.  Therefore,  there  is  a  need  to  construct  bench¬ 
mark  datasets  covering  a  larger  number  of  events.  In  addition, 
for  event  recounting  there  is  still  no  well-defined  datasets. 
To  advance  technical  development  in  this  direction,  good 
datasets  are  desired. 

7  Conclusions 

We  have  presented  a  comprehensive  survey  of  techniques 
for  high-level  video  recognition  in  unconstrained  videos. 
We  have  reviewed  several  important  topics,  including  sta¬ 
tic  frame-based  features,  spatio-temporal  features,  acoustic 
features,  audio-visual  joint  representations,  bag-of-features, 
kernel  classifiers,  graphic  models,  knowledge-based  tech¬ 
niques,  and  fusion  techniques.  We  also  discussed  several 
issues  that  emerged  because  of  particular  application  require¬ 
ments,  such  as  event  localization  and  recounting,  as  well 
as  scalability  and  efficiency.  Moreover,  we  described  pop¬ 
ular  benchmarks  and  evaluation  criteria,  and  summarized 
key  components  of  systems  that  achieved  top  performance  in 
recent  TRECVID  evaluations.  With  a  few  promising  direc¬ 
tions  for  future  research  given  at  the  end,  we  believe  that  this 
paper  can  provide  valuable  insights  for  current  researchers 
in  the  field  and  useful  guidance  for  new  researchers  who  are 
just  beginning  to  explore  this  topic. 
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